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Abstract: We present here a PAC-Bayesian point of view on adaptive 
supervised classification. Using convex analysis on the set of posterior prob- 
ability measures on the parameter space, we show how to get local measures 
of the complexity of the classification model involving the relative entropy 
of posterior distributions with respect to Gibbs posterior measures. We then 
discuss relative bounds, comparing the generalization error of two classifica- 
tion rules, showing how the margin assumption of Mammen and Tsybakov 
can be replaced with some empirical measure of the covariance structure 
of the classification model. We also show how to associate to any posterior 
distribution an effective temperature relating it to the Gibbs prior distribu- 
tion with the same level of expected error rate, and how to estimate this 
effective temperature from data, resulting in an estimator whose expected 
error rate converges according to the best possible power of the sample size 
adaptively under any margin and parametric complexity assumptions. Then 
we introduce a PAC-Bayesian point of view on transductive learning and use 
it to improve on known Vapnik's generalization bounds, extending them to 
the case when the sample is made of independent non identically distributed 
pairs of patterns and labels. Eventually we review briefly the construction 
of Support Vector Machines and show how to derive generalization bounds 
for them, measuring the complexity either through the number of support 
vectors or through transductive or inductive margin estimates. 
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Introduction 

Among the possible approaches to pattern recognition, statistical learning 

theory has received a lot of attention in the last few years. Although a 
realistic pattern recognition scheme involves data pre-processing and post- 
processing that need a theory of their own, a central role is often played by 
some kind of supervised learning algorithm. This central piece of work is the 
subject we are going to analyse in these notes. 

Accordingly, we assume that wc have prepared in some way or another a 
sample of A'" labelled patterns iXi,Yi)fL-^, where Xi ranges in some pattern 
space X and Yi ranges in some finite label set y. We also assume that we 
have devised our experiment in such a way that the couples of random vari- 
ables {Xi,Yi) are independent (but not necessarily equidistributcd) . Here, 
randomness should be understood to come from the way the statistician 
has planned his experiment. He may for instance have drawn the XiS at 
random from some larger population of patterns the algorithm is meant to 
be applied to in a second stage. The labels Yi may have been set with the 
help of some external expertise (which may itself be faulty or contain some 
amount of randomness, therefore we do not assume that Yi is & function of 
Xi, and allow the couple of random variables {Xi,Yi) to follow any kind of 
joint distribution). In practice, patterns will be extracted from some high 
dimensional and highly structured data, like digital images, speech signals, 
DNA sequences, etc. We will not discuss here this pre-processing stage (al- 
though it poses crucial problems dealing with segmentation and the choice 
of a representation). 

To fix notations, let (X,, 5^i)^i be the canonical process on O = (X x y)^ 
(which means the coordinate process). Let the pattern space be provided 
with a sigma- algebra "B turning it into a measurable space (X, S). On the 
finite label space y, we will consider the trivial algebra H' made of all its 
subsets. Let [(X x V)^, (S (g) S')®^] be our notation for the set of prob- 
ability measures (i.e. of positive measures of total mass equal to 1) on the 
measurable space [(X x y)'^, (S x S')®'^]. Once some probability distribu- 
tion P G M\[{X X y)^, CB (g) B')®^] is chosen, it turns {Xi,Yi)f^^ into the 
canonical realization of a stochastic process modeling the observed sample 
(also called the training set). We will assume that P = Pi, where for 

each i = 1,. . . ,N, Pi e 'M\{X x y, !B (?) !B'), to reflect the assumption that 
we observe independent pairs of patterns and labels. We will also assume 
that we are provided with some indexed set of possible classification rules 
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where (0,'J') is some measurable index set. Assuming some indexation of 
the classification rules is just a matter of presentation. Although it leads to 
longer notations, it allows to integrate over the space of classification rules 
as well as over Jl using the usual formalism of multiple integrals. For this 
matter, we will assume that {6,x) hh- /^(x) : (6 x X, !B (g) — >■ {^,'S>') is a 
measurable function. 

In many cases = IJie/ ®« "^^i^^ be a finite (or more generally countable) 
union of subspaces, dividing the classification model "JIq = Ujg/ ^©i i^to 
a union of submodels. The importance of introducing such a structure has 
been put forward by V. Vapnik, as a way to avoid making strong hypothe- 
ses on the distribution P of the sample. If neither the distribution of the 
sample nor the set of classification rules were constrained, it is well known 
indeed that no kind of statistical inference would be possible. Considering 
a family of submodels is a way to provide for adaptive classification where 
the choice of the model depends on the observed sample. Restricting the 
set of classification rules is more realistic than restricting the distribution of 
patterns, since the classification rules are a processing tool left to the choice 
of the statistician, whereas the distribution of the patterns is not fully under 
his control, except for some planning of the learning experiment which may 
enforce some weak properties like independence, but not the precise shapes 
of the marginal distributions Pi which are as a rule unknown distributions 
on some high dimensional space. 

In these notes, we will concentrate on general issues concerned with a 
natural measure of risk, namely the expected error rate of each classification 
rule /51, expressed as 

1 ^ 
1=1 

As this quantity is unobserved, we will be led to work with the corresponding 
empirical error rate 

i=l 

This does not mean that pratical learning algorithms will always try to 
minimize this criterion. They often on the contrary try to minimize some 
other criterion which is linked with the structure of the problem and has 
some nice additional properties (like smoothness and convexity, for exam- 
ple). Nevertheless, and independently from the precise form of the estimator 
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9 : Q ^ Q under study, the analysis of R{9) is a natural question, and often 
corresponds to what is required in practice. 

Answering this question is not straightforward because, although R{0) is 
the expectation of r{9), a sum of independent Bernoulli random variables, 
R{9) is not the expectation of r{9), because of the dependence of 9 on 
the sample, and neither is r{9) a sum of independent random variables. 
To circumvent this unfortunate situation, some uniform control over the 
deviations of r with respect to R is needed. 

The PAC-Bayesian approach to this problem, originated in the machine 
learning community and pionneered by D. McAllester [251 126j . can be seen 
as some variant of the more classical approach of M-estimators relying on 
empirical process theory (as exposed for instance in |36j). 

It is built on three corner stones: 

• One idea is to embed the set of estimators of the type 9 : Q ^ Q 
into the larger set of regular conditional probability measures p : 

S')®^) ^ M^(e,T). We will call these conditional prob- 
ability measures posterior distributions, to follow a usual terminology. 

• A second idea is to measure the fluctuations of p with respect to the 
sample, using some prior distribution vr € M^(0, T), and the Kullback 
divergence function 3C(p, vr). The expectation P|3C(/f3, vr)} measures the 
randomness of p. The optimal choice of vr would be P(/o), resulting in 
a measure of the randomness of p equal to the mutual information 
between the sample and the estimated parameter drawn from p. Any- 
how, since P(/j) is as a rule no more observed than P, we will have 
to be content with some less concentrated prior distribution vr, result- 
ing in some looser measure of randomness, as shown by the identity 
P [X{p, vr)] =F{X[p, F{p)] }+X[P{p),7r]. 

• A third idea is to analyze the fluctuations of the random process 9 i-^ 
r{9) with respect to its mean process 9 i— > R{9) through the log-Laplace 
transform 




as a physicist prone to statistical mechanics (where this is called the 
free energy) would do. This transform is well suited to relate min^ge r{9) 
to infgge R{9). 

This monograph is devided into two sections. The first one deals with the 
inductive setting presented in these lines, the second one with the trans- 
ductive setting, where, following Vapnik's seminal approach [^Tj, a shadow 



Olivier Catoni 



May 28, 2006 



7 



sample is considered. 

In the first section, two types of bounds are shown. Empirical bounds can 
be used to choose between estimators or to build estimators. Non random 
bounds can be used to assess the speed of convergence of estimators, relating 
this speed to the speed of convergence of the Gibbs prior expected error 
rate /3 i-^ ^exp(-/3_R) (-^) towards ess inf ,r R as P goes to infinity, and to other 
quantities akin to the margin assumption of Mammen and Tsybakov in more 
sophisticated cases. We will progress from the most straighforward bounds 
to more elaborate ones, built to achieve a better asymptotic behaviour. We 
will thus introduce local bounds and relative bounds. From an asymptotic 
point of view, the culminating result of these notes is Theorem 11.391 (page 
I63() . It is used in Proposition 11.401 to build a classification rule which is 
proved to be adaptive in all the parameters of the Mammen and Tsybakov 
margin assumption and of a parametric complexity assumption in Corollary 
11.521 (page ITH]) of Theorem 11.501 (page [77|) . This opens the road to Theorem 
11.591 (page l<S8|) which performs two step localization on top of Theorem EMI 
in order to be able to achieve adaptive model selection with a decreased 
influence of the number of empirically unefficient models included in the 
comparison. The analysis of this bound is hinted at in subsequent pages, 
but not fully developed, since we are not sure the amount of technicalities 
it requires is worth it. Anyhow we would not like to induce the reader into 
thinking that each result in the first section is actually an improvement on 
the previous one, it is as a rule only an asymptotic improvement, and the 
price to pay for being asymptotically tighter is to get looser bounds for 
small sample sizes. What is a small sample size in practice is a question of 
ratio between the number of examples and the complexity (roughly speaking 
the number of parameters) of the model used to classify. Since our aim 
here is to describe classification methods suitable for complex data (images, 
speech, DNA, ...), we suspect that practitioners wanting to make use of 
our proposals will be confronted with small sample sizes more often than 
with large ones, and should try to make use of the simplest bounds first 
and see only afterwards whether the asymptotically better ones can bring 
them more for the size of samples their computers can handle and their 
data bases can provide. Let us advocate also that the results of this first 
section are not only of a theoretical nature for two reasons : the first one 
is that posterior parameter distributions can be computed effectively, using 
Monte Carlo techniques, there is a whole tradition about these computations 
in Bayesian statistics, proving that what we call here Gibbs estimators are 
not only a way to show that some optimal speeds of convergence can be 
reached in some theoretically well understood situations, but that they can 
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1 Inductive PAC-Bayesian learning 



also be computed in practice. The second reason is that a traditional non 
randomized estimator 9 £ Q oi the parameter can be approximated by a 
posterior distribution p which is supported by a fairly narrow neighboorhood 
of € 0, without spoiling excessively our bounds, resulting in a classification 
rule which is to provide a randomized answer only for a small amount of 
dubious examples and will most of the time issue the same deterministic 
answer as the classification rule indexed by 6 it is derived from. This is 
explained on page ITU 

In the second section, we show first how we can transport all the results 
obtained in the inductive case to the transductive case, allowing to replace 
prior distributions by partially exchangeable posterior distributions depend- 
ing on an extended sample were unlabelled shadow examples are added, with 
increased possibilities of adaptation to the data. We then focus on the small 
sample case, where local and relative bounds are not expected to be of great 
help. Using a fictitious (that is unobserved) shadow sample, we study Vap- 
nik type generalization bounds, showing how to tighten and extend them 
using some original ideas, like making no Gaussian approximation to the 
log-Laplace of Bernoulli random variables, — using a shadow sample of ar- 
bitrary size, — shrinking from the use of any symmetrization trick — and 
using a subset of the group of permutations suitable to cover the case of 
independent non identically distributed data. The culminating result of the 
second section is Theorem 12.171 on page I114| subsequent bounds showing 
the separate influence of the above ideas and providing an easier compari- 
son with Vapnik's original results. Vapnik type generalization bounds have a 
broad applicability, not only through the concept of VC dimension, but also 
through the use of compression schemes [21], which are briefly described on 
page 11051 

1. Inductive PAC-Bayesian learning 

The setting of inductive inference (as opposed to transductive inference 
to be discussed later) is the one described in the introduction. 

When we will have to take the expectation of a random variable Z : J7 — > 
m as well as of a function of the parameter h : Q ^ M. with respect to some 
probability measure, we will as a rule use functional short notations instead 
of resorting to the integral sign: thus we will write 'P{Z) for Z(ij)'P{duj) 
and Tr{h) for /g /i(0)7r(de). 

The PAC-Bayesian approach, in its simplest form, relies on some basic 
upper bound for the Laplace transform of suppg3y|;]^(0') [p{R) — p{r)] , or more 
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technically on some penalized variant of it, as will be seen. This will be 
the subject of the next subsection, where we will start with the Laplace 
transform of R{6) — r{9), for any 6 G Q, before encompassing posterior dis- 
tributions. As it is already easy to guess, the purpose of these preliminaries 
is to gain some uniform control on the lower deviations of the empirical error 
rate from the expected error rate under any posterior distribution. 

1.1. Basic inequality. In the setting described in the introduction, let 
us consider the Bernoulli random variables (Ji{9) = 1 [l^ 7^ /^(Xj)]. Using 
independence and the concavity of the logarithm function, it is readily seen 
that for any real constant A 

N 



log{p{exp[-Ar(0)]}} = ^ log{p [exp(- A^,)] } 

i=l 

( 1 ^ 



1=1 

The right-hand side of this inequality is the log Laplace transform of a 
Bernoulli distribution with parameter X^^i P(o"i) = R{9). As any Bernoulli 

distribution is fully defined by its parameter, this log Laplace transform is 
necessarily a function of R{9). It can be expressed with the help of the family 
of functions 

^a{p) = -a-^ log{l - [1 - exp(-a)]p}, aeR,pe (0, 1). 

It is immediately seen that is an increasing one to one mapping of the 
unit interval unto itself, and that it is convex when a > 0, concave when 
a < and can be defined by continuity to be the identity when a = 0. 
Moreover the inverse of $a is given by the formula 

This formula may be used to extend to g € K,, and we will use this 
extension without further notice when required. 

Using these notations, the previous inequality becomes 

log{p{exp[-Ar(^)]}} < -\^x[R{9)], proving 
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1 Inductive PAC-Bayesian learning 



Lemma 1.1. For any real constant A and any parameter 9 £ Q, 



pjexpjA <i>^[R{e)] -r{e) } 



< 1. 



In previous versions of this study, we had used some Bernstein bound, in- 
stead of this lemma. Anyhow, as it will turn out, keeping the log Laplace of 
a Bernoulli instead of approximating it provides simpler and tighter results. 
Lemma ll . II implies that for any constants A G ]R,+ and e €)0, 1), 



P 



> 1 - e. 



Choosing A G argmax<I> a_ + 

IR+ N 



log(e) 
A ' 



we deduce 



Lemma 1.2. For any e g)0, 1), any 9 e Q, 



R{9) < inf <^ 



-1 



r{9) 



log(e) 
A 



> 1 



e. 



We will illustrate throughout these notes the bounds we prove with a small 
numerical example: in the case where = 1000, e = O.Ol and r{9) = 0.2, 
we get with a confidence level of 0.99 that R{9) < .2402, this being obtained 
for A = 234. 

Now, to proceed towards the analysis of posterior distributions, let us put 



for short U\{9,uj) 



A 



and let us consider 



<^x\Ri9)] - r{9,uj) 

L jv 

logjp TT[exp{Ux)] }, where vr G M^(e,T) is some prior probability mea- 
sure on the parameter space. Using Fubini's theorem for non negative func- 
tions, we see that 

log{p[7r[exp(C/A)]] } = log{7r[p[exp(C/A)]] } < 0. 

To relate this quantity to the expectation p{U\) with respect to any poste- 
rior distribution /> : 17 ^ M5^(G), we will use the properties of the Kullback 
divergence vr) of p with respect to vr, which is defined as 



%{p,tt) 



/log(^)d/>, when/)<7r, 
+CX3, otherwise. 



The following lemma shows in which sense the Kullback divergence function 
can be thought of as the dual of the log Laplace transform. 
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Lemma 1.3. For any hounded measurable function : B — > IR,, and any 
probability distribution p G M^(0) such that 3<!(p, vr) < oo, 

log{7r[exp(/i)]} = p{h) - %{p,7r) + X{p,Tr^^p^h)), 

where by definition — 2^£i^!l = ^ P[ ( )] Consequently 
avr 7r[exp(n,)J 

log{7r[exp(/i)]] } = sup p{h) - 3C(/3,7r). 

pG3Vt;(0) 

The proof is just a matter of writing down the definition of the quantities 
involved and using the fact that the Kullback divergence function is non neg- 
ative. It can be found in |171 page 160]. In the duahty between measurable 
functions and probability measures, we thus see that the log Laplace trans- 
form with respect to vr is the Legendre transform of the Kullback divergence 
function with respect to vr. Using this, we get 

P|exp{ sup p[C/a(^)] -3C(/5,7r)}| < 1, 

which, combined with the convexity of A$ a., proves the basic inequality we 
were looking for. 



Theorem 1.4. For any real constant A, 



jexp 


sup A 







< 



p{^x_oR) - p{r) -X{p,TT 

'^A[piR)]-pir) -%{p,T^) 



jexp 


sup A 







< 1. 



The following sections will show how to use this theorem. 



1.2. Non local bounds. At least three sorts of bounds can be deduced 
from Theorem II. 41 

The most interesting ones to build estimators and tune parameters, as 
well as the first that have been considered in the development of the PAC- 
Bayesian approach, are deviation bounds. They provide an empirical upper 
bound for p{R) — that is a bound which can be computed from observed 
data — with some probability 1 — e, where e is a presumably small and 
tunable confidence level. 
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1 Inductive PAC-Bayesian learning 



Anyhow, since most of the results about the convergence speed of es- 
timators to be found in the statistical literature are concerned with the 
expectation P [/9(-R)] , it is also enlightening to bound this quantity. In order 
to know at which rate it may be approaching infe R, a non random upper 
bound is required, which will relate the average of the expected risk P [p(-R)] 
with the properties of the contrast function 6 i— > R{6). 

Since the values of constants do matter a lot when a bound is to be used 
to select between various estimators using classification models of various 
complexities, a third kind of bound, related to the first, may be considered 
for the sake of its hopefully better constants: we will call them unbiased 
empirical hounds, to stress the fact that they provide some empirical quan- 
tity whose expectation under P can be proved to be an upper bound for 
P [p{R)\ , the average expected risk. The price to pay for these better con- 
stants is of course the lack of formal guarantee given by the bound : two 
random variables whose expectations are ordered in a certain way may very 
well be ordered in the reverse way with a large probability, so that basing 
the estimation of parameters or the selection of an estimator on some unbi- 
ased empirical bound is a hazardous business. Anyhow, since it is common 
practice to use the inequalities provided by mathematical statistical theory 
while replacing the proven constants with smaller values showing a better 
practical efficiency, considering unbiased empirical bounds akin to devia- 
tion bounds provides an indication about how much the constants may be 
decreased while not violating the theory too outrageously. 

1.2.1. Unbiased empirical bounds. Let p : Q. ^ JA\{Q) be some fixed (and 
arbitrary) posterior distribution, describing some randomized estimator of 0. 
As we already mentioned, in these notes a posterior distribution will always 
be a regular conditional probability measure. By this we mean that 

• for any A ^ 7, the map uj ^ p{oj, A) : (O, (2 ® 'B')'^^) P+ is 
assumed to be measurable; 

• for any u; € O, the map A i— > p{uj, A) : 7 ^ M.+ is assumed to be a 
probability measure. 

We will also assume without further notice that the u-algebras we deal 
with are always countably generated. The technical implications of these 
assumptions are standard and discussed for instance in ^| pages 50-54] 
(where, among other things, a detailed proof of the decomposition of the 
Kullback Liebler divergence is given). 
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Let us restrict to the case when the constant A is positive. We get from 
Theorem 11.41 that 



exp 



a{ f . [P [p{R)] ] - P [p{r)] } - P [%{p, n)] 



< 1, 



(1.1) 



where we have used the convexity of the exp function and of ^>^. Since we 

JV 

have restricted our attention to positive values of the constant A, Equation 
(|1.1|1 can also be written 



F[p{R)] <<^~,^F[p{r) + X-'X{p,7r)]}, 

AT <- J 



leading to 



Theorem 1.5. For any posterior distribution p 
positive parameter X, 



M\{Q), for any 



F[piR)] 



1 



exp 



< 



-N-^F[Xp{r)+X{p,7r)] 



< P< 



1 — exp(- 
A 



A' 

' N ' 



Ar[i_exp(-A)] 



p{r) + 



3C(p,vr) 
A 



The last inequality provides the unbiased empirical upper bound for p{R) we 
were looking for, meaning that the expectation of 



A 



p{r) + 



A 



is larger than the expectation of p{R). Let us 
< [l — ^ and therefore that this coeffi- 



iv[l-exp( 

notice that 1 < — i —r ^ i ± — ttiu i 

- 7v[l-exp{-A)] - L 2Ni 

cient is close to 1 when A is significantly smaller than N. 

If we are ready to believe in this bound (although this belief is not mathe- 
matically well founded, as we already mentioned), we can use it to optimize 
A and to choose p. While the optimal choice of p when A is fixed is to take it 
equal to 7rexp(-Ar)! a Gibbs posterior distribution, as it is sometimes called, 
we may for computational reasons be more interested in choosing p in some 
other class of posterior distributions. 

For instance, our real interest may be to select some deterministic es- 
timator from a family 6m '■ ^ ^ Qm, ""^ G M, of possible ones, where 
@rn are measurable subsets of Q and where M is an arbitrary (non nec- 
essarily countable) index set. We may for instance think of the case when 
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6m € argmine^ r. We may slightly randomize the estimators to start with, 
considering for any 6 G 6^ and any m & M, 

A^(^) = {e' e Gm : [fe'{X,)]l^ = [feiX,)]^^], 

and defining by the formula 

dpm,^, t[e£Am{em)] 

-(0) = 



dn 



vr 



[AmiOr, 



Our posterior is minimizing %{p, tt) among those whose support is restricted 
to the values of 9 in for which the classification rule Jq is identical to the 
estimated one on the observed sample. Presumably, in many practical 
situations, /^(x) will be pm almost surely identical to (x) when 9 is drawn 
from pm, for the vast majority of the values of a: G X and all the submodels 
Qm not plagued with too much overfitting (since this is by construction 
the case when x G {Xi : i = 1,. . . ,N}). Therefore replacing 9m with pm 
can be expected to be a minor change in many situations. This change by 
the way can be estimated in the (admittedly not so common) case when 
the distribution of the patterns {Xi)^^ is known. Indeed, introducing the 
pseudo distance 



1 ^ 



one immediately sees that R{9') < R{9) + D{9,9'), for any 9,9' G 6, and 
therefore that 

R{9m) < Pm{R) + Pn,[Di;9m)]- 

Let us notice also that in the case where @m C. M,"''", and R happens to 
be convex on Am{9m), then Pm{R) > R[J 9pm{d9)], and we can replace 
9m with 9m = J 9pm{d9), and obtain bounds for R{9m)- This is not a very 
heavy assumption about i?, in the case where we consider 9m G arg mine^ r. 
Indeed, 9m, and therefore Am{9m), will be presumably close to argmine^ R, 
and requiring a function to be convex in the neighboorhood of its minima 
is not a very strong assumption. 

Since r{9m) = Pm{r), and 0C(/9^,7r) = - log{7r[A^(6'„i)] }, our unbiased 
empirical upper bound in this context reads as 

A /,(,-^)_Mt[aA)]}1 



JV[l-exp(-A)] 
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Let us notice that we obtain a complexity factor — log|7r[A„j(^?„j)] } which 
may be compared with the Vapnik Cervonenkis dimension. Indeed, in the 
case of binary classification, when using a classification model with VC di- 
mension not greater than /t^, that is when any subset of X which can be 
split in any arbitrary way by some classification rule fg of the model 
has at most hm points, then 

{A^(^) :ee@m} 
is a partition of 6^ with at most {^)^ components. Therefore 

inf - log{7r[A„(0)] } < hm log f - log[7r(e^)] . 

9€Sm \ rim / 

Thus, if the model and prior distribution are well suited to the classification 
task, in the sense that there is more "room" (where room is measured with 
tt) between the two clusters defined by 9m than between other partitions of 
the sample of patterns {Xi)fLi, then we will have 

- log{7r [A^(^)] } < hm log (^) - log [7r(e^)] . 
An optimal value in may be selected so that 

• / • f A ( , logHA^^A 1 

m G arg mm < mi — r-^ r(ti„,) — — > . 

^meM|AeR+iV[l-exp(-A)] \^ ^ A )\ 

Since pfh is still another posterior distribution, we can be sure that 



-AeR+ \iv[i-exp(-A)] \ ^ A jj 

(Taking the infimum in A inside the expectation with respect to P would 
be possible at the price of some supplementary technicalities and a slight 

increase of the bound that we prefer to postpone to the discussion of devia- 
tion bounds, since they are the only ones to provide a rigorous mathematical 
foundation to the adaptive selection of estimators.) 
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1.2.2. Optimizing explicitly the exponential parameter A. We would like 
to deal in this section with some technical issue we think helpful to the 
understanding of Theorem 11.51 (see page I13() : namely to investigate how 
the upper bound it provides could be optimized, or at least approximately 
optimized, in A. It turns out that this can be done quite explicitely. 

So we will consider in this discussion the posterior distribution p : i7 ^ 
M5,_(0) to be fixed, and our aim will be to eliminate the constant A from 
the bound by choosing its value in some nearly optimal way as a function of 
P \p{r)\ , the average of the empirical risk, and of P [3C(/9, vr)] , which controls 
overfitting. 

Let the bound be written as 



ip{\) = [l - exp(- 
We see that 
iV^log[v.(A)] =- 



1 — exp 



P[p(r)] -iV-ip[3C(p,7r)]]} 



P[p(r 



exp 



Ap[p(r)] +iV-ip[3C(/5,7r) 



1 exp(^) - 1 



Thus, the optimal value for A is such that 

[exp( A) - 1] P [p{r)\ = exp [ Ap [p(r)] + iy-ip [%{p, vr) 



1. 



Assuming that 1 ^ ■^P[/9(r)] ^ iPt^fa^)! ^ ^^^^ keeping only higher order 
terms, we are led to choose 



2iVP [%{p, vr) 



P[p(r)]{l-P[p(r)]}' 



obtaining 



Theorem 1.6. For any posterior distribution p : Q ^ !M^(G), 

r / 2P[x(p,7r)]p[pM" nnp,^)] \ 

'{-y JV|l-Pip(r)|| N / 



F[p{R)] 



< 



exp ■ 



1 — exp 



{ 



2P[3C(p,7r)] 



AfP[p(r)]{l-P[p(r)]} 



} 



This result of course is not very useful in itself, since none of the two quan- 
tities F[p{r)] and F[X{p,7r)] are easy to evaluate. Anyhow it gives a hint 
that replacing them boldly with p{r) and OC{p, vr) could produce something 
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close to a legitimate empirical upper bound for p{R). We will see in the 
subsection about deviation bounds that this is indeed essentially true. 

Let us remark that in the second section of these notes, we will see another 
way of bounding 

inf <1>^^ ( Q + -r] > leading to 
AGR+ JV V 



Theorem 1.7. For any prior distribution vr G 'M.^{@), for any posterior 
distribution p : i7 — > Mj',_(0), 



2P[3C(p,^)]P[p(r)]{l-P[p(r)]} P[3C(p,^)]- 



+ \ ^— - ; - ^__^ + 



as soon as 



N 

P[p(r)] + 



'P[3C(p,7r)] ^ 1 



2A^ - 2' 



aril 



d P < P [p{r)] + y ^ ^^2iV otherwise. 



This theorem enlightens the influence of three terms on the average expected 
risk : 

• the average empirical risk, P[/)(r)] , which as a rule will decrease as the 
size of the classification model increases, acts as a bias term, grasping the 
ability of the model to account for the observed sample itself; 

• a variance term P [p(r)] |l — P } is due to the random fluctuations 
of p{r); 

• a complexity term P [3C(/9, vr)] , which as a rule will increase with the size 
of the classification model, eventually acts as a multiplier of the variance 
term. 



We observed numerically that the bound provided by Theorem 11.61 is 
better than the more classical Vapnik's like bound of Theorem 11.71 For 
instance, when N = 1000, P[p(r)] = 0.2 and P [3C(p, vr)] = 10, Theorem 10)1 
gives a bound lower than 0.2604, whereas the more classical Vapnik's like 
approximation of Theorem 1 1 . 71 gives a bound larger than 0.2622. Numerical 
simulations tend to suggest the two bounds are always ordered in the same 
way, although this could be a little teadious to prove mathematically. 
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1.2.3. Non random bounds. It is time now to come to less tentative results 
and see how far is the average expected error rate P[p(i2)] from its best 
possible value infe R. 
Let us notice first that 

Xp{r) + X{p,it) = 3<:(p,7rexp(-Ar)) - logjvr [exp(-Ar)] |. 



Let us remark moreover that r i— > log 



7r[exp(— Ar)] 



is a convex functional, 



a property which can be used in the following way: 
p|log[7r[exp(-Ar)]j I = p| sup -Xp{r) - X{p,7r)\ 

> sup p|-Ap(r) -3C(/),7r)| = sup -Xp{R) - X{p,7r) 

= logjTT [exp(- Ai?)] } = - ^ T^eM-m {R)dl3. (1.3) 
These remarks applied to Theorem 11.51 lead to 

Theorem 1.8. For any posterior distribution p : JA^iQ), for any 

positive parameter \, 



¥[p{R)] 



< 



1 -exp|-^/Q'^7rexp(-/3R)(i?)d/3- ^P[3<;(p,7rexp{-Ar))]} 



l-exp(-A) 



- l^TT, \ am ! / ^exp(-/3i?,)(-R)rf/3 + P[3C(p,7rexp{-Ar))] )• 

iV [1 - exp(-;^jj IJo ) 

This theorem is particularly well fitted for the case of the Gibbs poste- 
rior distribution p = iTcxp(-Xr) i where the entropy factor cancels and where 
IP [7rexp(-Ar) (-?^)] is shown to be bound to get close to infe R when goes 
to oo, as soon as X/N goes to while A goes to +oo. 

We can elaborate on Theorem 11.81 and define a notion of dimension of 
{Q,R), with margin ?? > putting 

dr,{@,R) = sup /3[7rexp(-/3i?)(-R) " ess iuf R-T]] 

/3GR+ 

< -logjvr [i? < ess infi? + ?7] I . (1.4) 
This last inequality can be established by the chain of inequalities: 
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PT^eM-PR) (-^) - ^exp(-7R) {R)dl = - logjTT [exp(-/3i?)] | 

< (3 ^ess inf R + rj^ — log |^vr(i? < ess inf R + jj) 



where we have used successively the fact that A i-^ 7rcxp{-A_R) {R) decreasing 
(because it is the derivative of the concave function A i— > — log{7r[exp(— Ai?)] }) 
and the fact that the exponential function takes positive values. 

In typical "parametric" situations (io(0,i?) will be finite, and in all cir- 
cumstances dri{@, R) will be finite for any 77 > (this is a direct consequence 
of the definition of the essential infimum). Using this notion of dimension, 
we see that 



/ T^eM-m) < ^ (ess inf R + rj) 

Jo 



+ 

= A (ess inf R + rj) +dr,{&,R) log 
This leads to 



A (1 — ess inf i? — 77) 



dp 



eX 



[drj{Q,R) 



(1 — ess inf R — rj) 



Corollary 1.9 With the above notations, for any margin rj G 11+, for any 
posterior distibution p : O — > M^{Q), 



F[p{R)] < inf ^> 



-1 



ess inf i? + 77 + log 

71- A 



eA _^ F{X [p, 7rexp(-Ar)] } 



A 



If one is wanting a posterior distribution with a small support, the theorem 
can also be applied to the case when p is obtained by truncating 7rexp(-Ar) 
to some level set to reduce its support: let @p = {9 E & : r{9) < p}, and let 
us define for any q g)0, 1) the level Pq = inf{p : T^exp{- Xr){®p) ^ q}i let us 
then define Pq by its density 

'^^cxp(— Ar) ^cxp(— Ar) 

then po = 7rexp(-Ar) for any q € (0, 1(, 

1 - exp {-^ /;7rexp(-^fl)(i?)d^ - 



^[pgiR)] < 



l-exp(-A) 



< 
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1.2.4- Deviation bounds. They provide results holding under the distribu- 
tion P of the sample with probability at least 1 — e, for any given confidence 
level, set by the choice of e g)0, 1(. Using them is the only way to be quite 
(i.e. with probability 1 — e) sure to do the right thing, although this right 
thing may be overpessimistic, since deviation upper bounds are larger than 
corresponding non biased bounds. 

Starting again from Theorem 11.41 and using Markov's inequality 
P [exp(/i) > 1] < P [exp(/i)] , we obtain 

Theorem 1.10. For any positive parameter A, with P probability at least 
1 — e, for any posterior distribution p : Q ^ Mj'j_(0), 

X(p,7r)-log(e)' 



p{R) < 

N 



p{r) + 



1 — exp 



Ap(r) 
N 



A 



log(e) 



N 



< 



N [1 



exp 



l-exp(-A) 



p{r) + 



4)] 



3C(p,7r)-log(e) 



A 



We see that for a fixed value of the parameter A, the upper bound is 
optimized when the posterior is chosen to be the Gibbs distribution p = 

^cxp(— Ar) • 

Moreover we would like to be entitled to optimize the bound in A. Gaining 
the required uniformity in A can be done in the following way. Let us notice 
first that values of A less than 1 are not interesting (because they provide 
a bound larger than one, at least as soon as e < exp(— 1)). Let us consider 
some real parameter a > 1, and the set A = {a'^;A; S M}. Let us put on 
this set the probability measure iy{a ) = [{k + 1)(A; + 2)]-i. Applying the 



previous theorem to A 



a 



at confidence level 1 



and using a 



(fc+l){fc+2) ■ 

union bound, we see that with probability at least 1 — e, for any posterior 
distribution p, 



p(R) < inf ^> 
^ A'gA 



p{r) + 



X(p,7r)-log(e) + 21og 



log(a^A') 
Iog{o) 



A' 



Now we can remark that for any A G (l,+oo(, there is A' G A such that 
a^^A < A' < A. Moreover, for any q e (0,1), (3 ^^^{q) is increasing on 
P_i_. Thus with probability at least 1 — e, for any posterior distribution p, 



p(R) < inf ^> 

AG(l,oo{ 



a 
+ A 



3C(p,7r)-log(e) + 21og 



/log{afA) 

V l0g(Q!) 
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= inf ■ 
Ae(i,cx)( 



exp {-Ap(r) - ^ tt) - log(6) + 21og(ilg^)] } 



l-exp(-A) 



Taking the approximately optimal value 



l 2Na [J<:(p,7r)-log(e)] 
pm-pir)] 



we obtain 



Theorem 1.11. With probability 1 — e, for any posterior distribution p : 
n Mi(G), putting d{p, e) = X{p, vr) - log(e), 



p(R) < inf 



, 1 r 

1 - exp <j -j^pir) - - [d{p, e) + log [(A; + l)(fc + 2)] 



1 — exp — 



a 



1 — exp 



< 



1 2ap{r)d{p, e) a 
N[l - p{r)] ~ N 



I / 2 / 2iVad(p,7)~ ^ 



1 — exp 



2ad{p, e) 



7Vp(r)[l-p(r)] 



Moreover with probability at least 1 — e, for any posterior distribution p such 
that p{r) = 0, 

%(p,7r)-log(e) 



p{R) < 1 — exp 



N 



We can also elaborate on the results in an other direction by introducing 
the empirical dimension 

de = sup /3[7rexp(-/3r)(^) ~ 6ssinf r] < — log[7r(r = ess inf r)] . (1-5) 
/3eiR.+ ^ 

(There is no need to introduce a margin in this definition, since r takes at 
most N values, and therefore 7r(r = ess inf^ r) will be strictly positive.) This 
leads to 



Corollary 1.12. For any positive real constant X, with P probability at 
least 1 — e, for any posterior distribution p : Q 



p{R) < ^-.^ 



. „ de, /eA\ 3<:[p,7rexp(-Ar)] -log(e) 
ess mf r + — log 3- H 



A 



dp 



A 
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We could then make the bound uniform in A and optimize this parameter 
in a way similar to what was done to obtain Theorem II. Ill 

1.3. Local bounds. In this subsection, better bounds will be achieved 
through a better choice of the prior distribution. This better prior distribu- 
tion turns out to depend on the unknown sample distribution P, and some 
work is required to circumvent this and obtain empirical bounds. 

1.3.1. Choice of the prior. As mentioned in the introduction, if one is will- 
ing to minimize the bound in expectation provided by Theorem 11.51 (page 
I13|) . one is led to consider the optimal choice vr = P(/5). However, this is 
but an ideal choice, since P is in all conceivable situations unknown. Nev- 
ertheless it shows that it is possible through Theorem 11.51 to measure the 
complexity of the classification model with P|3C[/9, P(p)] }, which is nothing 
but the mutual information between the random sample {Xi,Yi)f^-^ and the 
estimated parameter 6, when the sample is drawn according to P and the 
estimated parameter knowing the sample is drawn according to p. 

In practice, since we cannot choose tt = P(/o), we have to be content 
with a flat prior vr, resulting in a bound measuring complexity according 
to P[3C(p,7r)] = P{3<:[/9,P(p)]} +%\P{p),-k] larger by the entropy factor 
3<][P(p),7r] than the optimal one (we are still commenting on Theorem ll.5() . 

If we want to base the choice of vr on Theorem 11.81 (page I18() , and if we 
choose p = vrcxp(-AT-) to optimize this bound, we will be inclined to choose 
some vr such that 



^exp 



^-pR){R)dl3 = -^log{vr[exp(-Ai?)]} 



is as far as possible close to infgge R{Q) in all circumstances. To give some 
more specific example, in the case when the distribution of the design {Xi)f^-^ 
is known, one can introduce on the parameter space the metric D already 
defined by equation (|1.2( page I14|) (or some available upper bound for this 
distance). In view of the fact that R{e) - R{9') < D{9, 9'), for any 6, 9' £ 9, 
it can be meaningful, at least theoretically, to choose vr as 



vr 



oo ^ 



where vr^ is the uniform measure on some minimal (or close to minimal) 
2~'^-net yi{Q,D,2~^) of the metric space {Q,D). With this choice 
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- J log{7r[exp(-Ai?)] } < mf 



^ L-k + iog(|:^'(e.r^,2-^oi) + iog[/.(A- + i)] 

k \ A 

Another possibility, when we have to deal with real valued parameters, 
meaning that 6 C M!^, is to code each real component € H of 6* = {8i)f^i 
to some precision and to use a prior fj, which is atomic on dyadic numbers. 
More precisely let us parametrize the set of dyadic real numbers as 



2) = |r [s, m,p, {bjf.^,] = .2™ (^1 + ^ 6,2-^) 



: s e {-l,+l},m G Z,p G M,bj € {0,1}|, 

where, as can be seen, s codes the sign, m the order of magnitude, p 
the precision and (bj)^^^ the binary representation of the dyadic number 
r[s,m,p, {bj)^^i] . We can for instance consider on D the probability distri- 
bution 

l^{r[s,m,p,{bjYj^^]} = [3(|m| + l)(|m| + 2)(p + l)(p + 2)2P]"\ (1.6) 

and define tt G M])_(]R,'^) as vr = fi^'^. This kind of "coding" prior distribu- 
tion can be used also to define a prior on the integers (by renormalizing 
the restriction of fi to integers to get a probability distribution). Using fi 
is somehow equivalent to picking up a representative of each dyadic inter- 
val, and makes it possible to restrict to the case when the posterior p is 
a Dirac mass without losing too much (when = (0,1), this approach is 
somewhat equivalent to considering as prior distribution the Lebesgue mea- 
sure and using as posterior distributions the uniform probability measures 
on dyadic intervals, with the advantage of obtaining non randomized esti- 
mators). When one uses in this way an atomic prior and Dirac masses as 
posterior distributions, the bounds proven so far can be obtained through a 
simpler union bound argument. This is so true that some of the detractors 
of the PAC-Bayesian approach (which, as a newcomer, has sometimes re- 
ceived a suspicious greeting among statisticians) have argued that it cannot 
bring anything that elementary union bound arguments could not essentially 
provide. We do not share of course this derogatory opinion, and while we 
think that allowing for non atomic priors and posteriors is worthwhile, we 
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also would like to stress that next to come local and relative bounds could 
hardly be obtained with the only help of union bounds. 

Although the choice of a flat prior seems at first glance to be the only 
alternative when nothing is known about the sample distribution P, the 
previous discussion shows that this type of choice is lacking proper localisa- 
tion, and namely that we loose a factor 3C{P [7rexp(_Ar)] > ^j? the divergence 
between the bound-optimal prior P [7rexp(_Ar)] ) which is concentrated near 
the minima of R in favourable situations, and the flat prior vr. Fortunately, 
there are technical ways to get around this difficulty and to obtain more 
local empirical bounds. 

1.3.2. Unbiased local empirical hounds. The idea is to start with some flat 
prior TT G Mi',_(G), and the posterior distribution p = vrgxp(-Ar) minimizing 
the bound of Theorem 11.51 (page [T3|) . when vr is used as a prior. To im- 
prove the bound, we would like to use P [vrexp(-Ar)] instead of vr, and we are 
going to make the guess that we could approximate it with vrexp{-/3_R) (we 
have replaced the parameter A with some distinct parameter (3 to give some 
more freedom to our investigation, and also because, intuitively, P [7rexp(-Ar)] 
may be expected to be less concentrated than each of the vrexp(_Ar) it is mix- 
ing, which suggests that the best approximation of P ['/rexp(-Ar)] by some 
vrcxp{-/3_R) may be obtained for some parameter /3 < A). We are then led 
to look for some empirical upper bound of 3C [/O, vrexp(-/3_R)] . This is happily 
provided by the following computation 

P{aC[p,^exp(-/3i?)]} =P[3C(/o,^)] +PV[p{R)\ +log{^[exp(-/3i2)]} 
= P{3C[p, vrexp(-/3r)] } + /3P [p(^ - 

+ log{vr[exp(-/3ii)] } - p{logvr[exp(-/3r)] }. 

Using the convexity of r i— > logjvr [exp(— /3r)] } as in equation on page 
[THl we see that 

Q<¥{%[p, Vrexp(-/3iJ)] } < /?P [p{R - r)] +V{%[p, vr,xp(-;3r)] } • 

This inequality has an interest of its own, since it provides a lower bound 
for P [p{R)\ . Mor eover we can plug it into Theorem 11.51 (page 113(1 applied 
to the prior distribution vrexp(-/3_R) and obtain for any posterior distribution 
p and any positive paramter A that 

$ A {P [p{R)] ] < pL{r) + ^p{R - r) + jF[x[p, n.^pf^.pr)] } 
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In view of this, it it convenient to introduce the function 

= (1 - 6rM$a(p) - 6p] 

= -(1 - 6)-i{a-i log{l -p[l- exp(-a)] } + 

p G (0, l),a €)0,oo(,6 G (0, 1(. 

This is a convex function of p, moreover 

Km = [1 - exp(-a)] - - br\ 

showing that it is an increasing one to one convex map of the unit interval 
unto itself as soon as 6 < a~^[l — exp(— a)]. Its convexity, combined with 
the value of its derivative at the origin, shows that 

^aM > l p. 

Using these notations and remarks, we can state 

Theorem 1.13. For any positive real constants (3 and X such that < /? < 
N[l — exp(— -4)], for any posterior distribution p : U —> M.\{@), 



P p(r) 



3C[/3,7rexp(- 



■/3r)J 



— A i. 

JV' A 



< 



p 



<F[p{R)] 



iV[l-exp(-A)]-/3 



-P 



p(r) + 



X[p, 



1 '^exp{-0r)\ 

x-p 



Thus (taking X = 2j3), for any [5 such that < /3 < 



P[p(i?)] <-\pAp{r) + 

^ N ^ 



3<^[P>7rexp(- 



/3r)J 



Note that the last inequality is obtained using the fact that 1 — exp(— x) > 

2 

Corollary 1.14. For any (3 G {Q,N{, 
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F[7rexp{-/3r)(r-)] < P[^, 



exp(— /3r) 



< inf 

AG{-AfIog{l- 



x-p 



( N[l - exp(- 



P [vrexp(-/3r) 



< 



1 _ M 



P[ 



vr, 



exp(-;3r) ('^ )] ! 



^/le /ast inequality holding only when /3 < ^. 



It is interesting to compare the upper bound provided by this coroUary 
with Theorem 11.51 on page El when the posterior is a Gibbs measure p = 
^exp{-/3r)- We see that we have succeeded to get rid of the entropy term 
^ [^exp(-/3r) ) ■^j ! but the price of an increase of the multiplicative factor, 
which for small values of grows from (1 — ^)~^ (when we take A = /5 in 
Theorem [T3|) . to (1-^)"^ Therefore non localized bounds have an interest 
of their own, and are superseded by localized bounds only in favourable 
circumstances (presumably when the sample is large enough when compared 
with the complexity of the classification model). 

Corollarv II . 141 shows that when ^ is small, 'n'exp{-i3r){^) is a tight approx- 
imation of vrgxp(_/3r) (R) ill the mean (since we have an upper bound and a 
lower bound which are close together). 

Another corollary is obtained by optimizing the bound given by Theorem 
11.131 in p, which is done by taking p = TTcxp{-\r)- 

Corollary 1.15. For any positive real constants (3 and A such that < 
/3<iV[l-exp(-A)], 



exp(— Ar) 



{R)] <S,%<|P 

JV ' A 



\-(5 



< 



'^exp{-^r){r)d-i 
-P 



A 



vr, 



exp 



(_^r)(r)d7 



iV[l-exp(-A)]-/3 up 

Although this inequality gives by construction a better upper bound for 
inf^igR^ P ['/rexp(-Ar) (-R)] than Corollarv II. 141 it is not easy to tell which one 
of the two inequalities is the best to bound P [vrexp(-Ar) {R)\ for a fixed (and 
possibly suboptimal) value of A, because in this case, one factor is improved 
while the other is worsened. 

Using the empirical dimension dg defined by equation (|1.5|) on page 1211 
we see that 

f-A 



A 



3^ T^eM-ir) ir)dj < ess inf r + 4 log 
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Therefore, in the case when we keep the ratio ^ bounded, we get a better 
dependence on the empirical dimension de than in Corollarv ll.121 (page I21 |) . 

1.3.3. Non random local bounds. Let us come now to the locahzation of the 
non random upper bound given by Theorem 11.81 on page 1181 According to 
Theorem 11.51 foage lT^ apphed to the locahzed prior 7rexp(-/3_R) , 

A$ A {P [p{R)] } < F[Xp{r) + X{p, vr) + + log{7r [exp(-/3i2)] } 

= p{3C[p,7rexp(-Ar)] " log{ ^ [exp(- Ar)] } + /3p(i2) } + log{ ^ [exp(-/3i2)] } 
] -log{^ [exp{-XR)] } +log{7r [exp{-f3R)] }, 

where we have used as previously inequality (|1.3j) (page [T8|) . This proves 



Theorem 1.16. For any posterior distribution p : Q ^ Mj'|_(0), for any 



real parameters (3 and A such that < /3 < A^[l — exp(- ^ 



- N[i-.M-u-A L '"p(-^«)(«)''^+'>'{'<:['''''«p(-A')]}}- 

Let us notice in particular that this theorem contains Theorem 11.81 (page 
[TH)l which corresponds to the case /3 = 0. As a corollary, we see also, taking 
p = 7rexp(-Ar) ^nd A = 2j3, and noticing that 7 1— >■ 7rexp(--yK) {R) is decreasing, 
that 

lP[^cxp(-Ar)(-R)] < inf —7- ^ AM ^^exp{-/3i?)(^) 

/3,/3<iV[i-cxp(-A)] Ar[i - exp(-f )J - /? 

^ 7^^exp{-|iJ)(^)- 

We can use this inequality in conjunction with the notion of dimension with 
margin rj introduced by equation (|1.4j) on page ^1 to see that the Gibbs 
posterior achieves for a proper choice of A and any margin parameter tj > 
(which can be chosen to be equal to zero in parametric situations) 

4d 

inf P [7rexp(-Ar) (R)] < ess inf i? + ?7 + 
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/ 2dr, (ess inf 7, R + rj) M'i 

— -^w- <'-^' 

Deviation bounds to come next will show that the optimal A can be esti- 
mated from empirical data. 

Let us propose a little numerical example as an illustration : assuming 
that do = 10, = 1000 and essinfjr R = 0.2, we obtain from equation (|1.7|) 

that infAP[^exp(-Ar)(«)] < 0.373. 

I. 3.4- Local deviation bounds. When it comes to deviation bounds, we will 
for technical reasons choose a slightly more involved change of prior distri- 
bution and apply Theorem 11.101 (page 00)) to the prior iTcxpi-f)^ is °R] ■ "^^^ 
advantage of tweaking R with the nonlinear function ^_p_ will appear in 

JV 

the search for an empirical upper bound of the local entropy term. Theorem 

II. 41 (page 111)1 . used with the above mentioned local prior, shows that 

(<I>A oi?) -p(r)} -D^[/5,7r,,p(_^^_^„^)]| < 1. (1.8) 



P< sup X\ p 

I p63Vti (0) 



Moreover 

X[P, 7rcxp[-/3<J._ 13 oR]] = X[P, 7rcxp(-/3r)] + PP 



"TV 



^_£oR - r 

N 



+ log|7r exp(-/3$_^oi?) | - logjvr exp(-/3r) |, (1.9) 

which is an invitation to find an upper bound for log|7r exp [— /3$_^oi?] | — 

log|7r [exp(— /3r)] |. Let us call for short W our localized prior distribution, 
thus defined as 

^(^) ^ exp{-/3^_4W]} 
7r{exp[-/3«>_^oi?]| 

Applying once again Theorem 11.41 fpage ITT)) , but this time to — /?, we see 

that 



p|exp|log|7r|^exp(-/3$_j_oi?) | - logjvr [exp(-/3r)] | 
= p|exp |log|7r 



exp(-p<^_^oR))]\ + inf pp{r)+%{p, 



TT 
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< pjexp 



logjTT exp(-/?$_|_oi?))J| + /3vf(r) + 3<:(7f, vr) 
pjexp /3 7f(r) -7f($_j_oi?)J -D<:(7f,7f) | < 1. (1. 



Combining equations H1.9|) and (|1.1U|) and using the concavity of we 

JV 

see that with P probabihty at least 1 — e, for any posterior distribution 

< 3C(/5,7f) < X[p,7r,,p^.f3r)] +p['^_A[piR)] - log(e). 

We have proved a lower deviation bound: 

Theorem 1.17 For any positive real constant (3, with P probability at least 
1 — e, for any posterior distribution /) : O — > M^(_(0), 



exp 



p(r) 



1 



exp(l^) -1 



Let us now seek for an upper bound. Using the Cauchy-Schwarz inequality 
to combine equations ()1.8|) and (|1.1U() . we obtain 



exp 



- sup Xp(^x_oR)-Pp(^_j3^^R)-{X-P)p{r)-X\p,Tr, 
2peM;(e) iv ^ ^ JV ^ 

= piexp i sup ( x\p{^A°R) - Pir)} -^{p,Tr)) 



exp(-/3r)J 



X exp 



< 



i(^log|7r exp(-/3$_|^oi?) | - logjTr exp(-/3r) |^ | 
p|exp[ sup (x\p(^^oR) - p{r)} -X{p,W)]]\ ^ 



pjexp ^logjTT exp(-/3<I)_^oi?)j I - logjvr exp(-/3r) | 



1/2 



< 1. 

(1.11) 

Thus with P probability at least 1 — e, for any posterior distribution p, 
X^^[p{R)] -P^_a[p{R)] < (A-/3)p(r)+3C(p,7rexp(_^,))-21og(e). 
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(It would have been more straightforward to use a union bound on devi- 
ation inequalities instead of the Cauchy-Schwarz inequality on exponential 
moments, anyhow, this would have led to replace — 21og(e) with the worse 
factor 21og(f ).) Let us now remind that 

A$ A (p) - f3^_Ap) = -iVlogjl - [1 - exp(-A)]p} 

-iVlog{l+ [exp(4) 

and let us put 



B = {X- I3)p{r) + %[p, ^exp{-/3r)] - 21og(e) 

= 3C[p,7rcxp(-Ar)] + / 7rexp(-gr)(?')fi^C - 21og(e). 

Let us consider moreover the change of variables a = 1 — exp(— ■^) and 
7 = exp(;§) - 1. 

We obtain [l — ap{R)\ [l + 7p(i?)] > exp(— ^), leading to 

Theorem 1.18. For any positive constants a, 7, such that < 7 < a < 1, 
with P probability at least 1 — e, for any posterior distribution p : 17 — > 
M^iQ), the bound 

\ log[(l -a)(l + 7)] 3C(p,7rexp[-7viog(i+7)r-]) -21og(e) 

M{p) = ^ '-p{r) + ^ 

a — 7 A' (a — 7j 

/>- Af log(l-a) 

3<^[p,7rexp[Ariog(l-a)r]] + / '^e:^p{-(,r){r)di - 2\og{e) 
_ JA^log(l+7) 

N{a - 7) 

is such that 



P{R) < ^ (^^1 + (^^{1 - -P[-(" - } - 1 j ^ ^(z')' 

Using the empirical dimension dg defined by equation (|1.5|) on page 1211 we 
can use the inequality 

vrexp(-5r) {r)d£. < (A - /?) ess inf r + 4 log , 
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to prove that 

log[(l + 7)(l-a)] . „ 
M [p) < ess mi r 

T — a 71" 

+ 3C[p,7rexp[Ariog(l-Q:)r]] " 2 log(e) 

N{a--f) ■ 

Let us give a little numerical illustration : assuming that de = 10 and = 
1000, taking e = 0.01, a = 0.5 and 7 = 0.1, we obtain from Theorem 11.181 

"^explN log(l— o)r] 

(i?) ~ vrexp(-693r) < 0.332 < 0.372, where we have given 
respectively the non linear and the linear bound. This shows the practical 
interest of keeping the non-linearity. Let us also mention that optimizing 
the values of the parameters a and 7 would not have yielded a significantly 
lower bound. 

The following corollary is obtained by taking A = 2/3 and keeping only 
the linear bound, we give it for the sake of its simplicity: 

Corollary 1.19. For any positive real constant (3 such that exp(-^) + 
exp(— ^) < 2, which is the case when (3 < 0A8N, with P probability at least 
I — e, for any posterior distribution p : O — > M^,_(G), 

, . Ppjr) +^[p,T^exp{-f3r)] - 21og(e) 

iV[2-exp(#)-exp(-f)] 

_ Jp'^ ■n-expi-^r)ir)d^ + ^[p,TreM-'il3r)] -21og(e) 
^ iV[2-exp(A)-exp(-f)] ' 

Let us mention that this corollary applied to the above numerical example 
gives vrgxp(_200r) ^ 0.475 (when we take P = 100, consistently with the 
choice 7 = 0.1). 

1.3.5. Partially local bounds. Local bounds are suitable when the lowest 
values of the empirical error rate r are reached only on a small part of the 
parameter set 0. When G is the disjoint union of submodels of different 
complexities, the minimum of r will as a rule not be "localized" in a way 
that calls for the use of local bounds. Just think for instance of the case 
when Q = |Jm=i ®m' where the sets 0i C 02 C • • • C 0m are nested. 
In this case we will have infe^ r > infea r > ■ ■ ■ > infe^vf i"-, although 0^- 
may be too large to be the right model to use. In this situation, we do not 
want to localize the bound completely. Let us make a more specific fancyful 



de log 



-log(l-a) 
log(l+7) 
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but typical pseudo computation. Just imagine we have a countable collection 
(0m)meM of submodels. Let us assume we are interested in choosing between 
the estimators 9m S arg mine^ r, maybe randomizing them (e.g. replacing 
them with T^^p(^_xr))- imagine moreover that we are in a typically 

parametric situation, where, for some priors tt'" G M^(0m), rn G M, there 
is a "dimension" dm such that ^['^"xp(_xr)('''^~^(^™-^] ~ ^ M^(M) 

be some distribution on the index set M. It is easy to see that (Mvr)exp(-Ar) 
will typically not be properly local, in the sense that typically 



(M71")exp(-Ar)('' 



/J {tTcxix-a, ) ('')7r [oxp(-Ar)] | 



/x|7r[exp(-Ar)]| 
E ^) + ^] exp[-A(inf r) - log(^)] //(m) 



meM 



J2 exp -A(infr) -d^log(g-) /x(m) 

neM 

~ i inf (inf r) + %log( , > 



+ log| exp[-d„log(^)]/x(m) 



where we have used the estimate 



- logjTT [exp(- Ar)] } = ^ T^e^^{-pr) {r)dp 

~ /\nf r) + A 1] d/3 ~ A(inf r) + d„ [log(^) + l] . 

Our approximations have no pretention to be rigorous or very accurate, 
but they nevertheless give the best order of magnitude we can expect in 
typical situations, and show that this order of magnitude is not what we are 
looking for: mixing different models with the help of /i spoils the localization, 
introducing a multiplier log(^) to the dimension d„i which is precisely 
what we would have got if we had not localized at all the bound. What 
we would really like to do in such situations is to use a partially localized 
posterior distribution, such as l^^p(^_xr)^ where m is an estimator of the best 
submodel to be used. While the most straightforward way to do this is to 
use a union bound on results obtained for each submodel 6^, we are going 
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here to show how to allow arbitrary posterior distributions on the index set 
(corresponding to a randomization of the choice of in). 

Let us consider the framework we just mentioned: let the measurable 
parameter set (0, T) be a disjoint union of measurable submodels, G = 
UmGM®m- Let the index set (M, M) be some measurable space (most of 
the time it will be a countable set). Let fj, G M^(Af) be a prior prob- 
ability distribution on (M,M). Let vr : M ^ M\{@) be a regular con- 
ditional probability measure such that TT{m,Qrn) = for any m € M. 
Let /iTT G M^(M X 0) be the product probability measure defined by 
/i7r(/i) = /^gjyj ( JggQ /i(m, ^)7r(m, (^6*)) fi{dm), for any bounded measurable 
function h : M x Q ^ M.. Let TTcxp{h) ^ M+(M x 6) be the regular condi- 
tionnal probability measure defined by 



dir. 



cxp{h) 
dlT 



{m,e) 



exp[h{9)] 
"[m, exp(/i)] ' 



where consistently with previous notations 71(171, h) = jQh{m,6)TT(m,d9) 
(we will also often use the less explicit notation iT{h)). Let for short 

Ui9,io) = [R{e)] - p<i>_^ [R{9)] - (A - P)r{9,io). 

Integrating with respect to n equation Hl.ll|) on page 1291 written in each 
submodel ©m using the prior distribution 7r(m, •), we see that 



exp 



sup 



sup 



{up){U)-iy{X{ [p, 7rexp(-/3r)] } p) 



exp 



1 



sup -fi sup p([7) - 3<;(/>,7rexp(-/3r)) - 3<;(z^,//) 

.!/eMl (M) ^ \p:M^MU0) 



P /i 



exp< i sup 



p{U)-X[p,7r, 



cxp(-/3r)J 



}]} 



p\f exp\l sup p{U) -X[p,7r^^p^__p^j] \ I < 1. 



This proves that 



P<^ exp 



1 

- sup 

2 u€M\{M) p:M^M\{e) 



sup A$A [I'piR)] - P'^^A ['^PiR)] 



- (A - I3)vp{r) - 2%{v, p)-v{%[p, 7rexp{-;3r)] } 



< 1. (1.12) 
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Introducing the optimal value of r on each submodel r*{m) = ess inf^(^ .) r 
and the empirical dimensions 

deim) = sup C[7rexp(-^r)("^>0 -r*{m)], 
we can thus state 

Theorem 1.20. For any positive real constants f3 < X, with P probability 
at least 1 — e, for any posterior distribution i/ : i7 ^ M,\{M), for any 
conditional posterior distribution p : ft x M ^ M^(G), 

N N 



where Bi {v, p) = (A - f3)up{r) + 20<;(i/, /x) + v[X [p, 7rexp(-/3r)] } - 2 log(e) 



= u 



{r)da + 2%{u, p) + v[%[p, 7rexp(-Ar)] } - 2 log(e) 
U/3 J 

= 21og|// exp^-^^ 7rexp(-ar)Wc?«) | 
+ 23^ h /^/^[exp(-A0lV/2] + ''{^ ^' ^cxp(-Ar)] } - 2 log(€), 



and therefore Bi{v, p) <y [(A - /?)r* + log(^|j 4] + 20<:(i^, /u) 

+ i^{3C[p,7rexp(-Ar)] } " 2Iog(e), 
as we// as Bi{u,p) < 21og|// exp^-^r* + ^log^^^^de^ | 

+ 2X\u,p ^l^^p( xr)] ] +z^{3<;[p,7rexp(-Ar)] -21og(e). 

7r[exp{-/3T-)J 

T/ius, /or any reaZ constants a and 7 suc/i f/iai < 7 < a < 1, rai/i P 
probability at least 1 — e, for any posterior distribution 1/ : fl ^ Mjj. (M) and 
any conditional posterior distribution p : Q x M ^ M]^(G); the bound 



B2iiy,p) = - 



log[(l-a)(l+7)] 



i'p{r) + 



a— J ' r \- / ' N(a—^) 



Ar(a-7) 

21og{/^ [exp -i /-™;^"' -cxp(-«.)(-''-)'i«]] }+21og{6) 



N{a-'y) 
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satisfies 



up{R) < 



1 + (a-^)2 {^ - exp[-(« - 7)B{i^, p)] } - 1 J < B{i^, p). 



Q -7 / 4a7 
2a7 W («-7)2 



T[(l-a)"''] 



AT^, ^1/2 and p — vr(i_^)jvr, 



Let us remark that in the case when u = p 

V7r[(l + 7)-'Vr]; 

we get as desired a bound that is adaptively local in all the 0m (at least 
when M is countable and p is atomic): 



B{v,p) < 



log < Mi exp 



f log[(l + 7)(l-a)]. 



< inf 



- log 

log[(l-a)(l+7)] 



-log(l-a) \ 

iog{i+7) ; 2 



a— 7 



r*(m) 

+ log / -^osil-a) 



de(m) 



log{l+7) ; N(a-'y) 



21og(e) 
N{a - 7) 



The penalization by the empirical dimension de{m) in each submodel is as 
desired linear in de{m). Non random partially local bounds could be obtained 
in a way that is easy to imagine. We leave this investigation to the reader. 

1.3.6. Two step localization. We have seen that the bound optimal choice 
of the posterior distribution v on the index set in Theorem 11.201 (page I34|) 
is such that 

, i 

vr [exp(— Ar(m, •))] 



dv 



V 




1 = exp 


\--f 
_ 2Jii 



vr, 



exp 



(_Q,r.)("T', r)da 



7r[exp(— /3r(m, •))] 

This suggests to replace the prior distribution p with /I defined by its density 

d/J expr-/i(m)l 
dp p[exp{-h)\ 



where h{m) = 7rexp(_Q#_ „ oR) [^-^ oR{m, •)] da. (1.13) 

The use of $ v oR instead of R is motivated by technical reasons which will 
appear in subsequent computations. Indeed, we will need to bound 

r /-A 



/ 7rexp(-a<I>_ ^ oR)(^-iLO^)da 
J/3 TT ^ _ 
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in order to handle X{u,'p,). In the spirit of equation (|1.8l page . starting 
back from Theorem II. 41 (page II apphed in each submodel Qm to the prior 
distribution vrexp(-7$ v oR) and integrated with respect to Ji, we see that 

for any positive real constants A, 7 and r/, with P probability at least 1 — e, 
for any posterior distribution : J7 ^ M^(M) on the index set and any 
conditional posterior distribution p : 17 x M — > M^(G), 

up[X^^oR - j^_j_oR) <Xup{r) 

+ i/3<:(p,7r) +aC(z^,7i) + z^|log 7r[exp(-7$_ioi?)] |-log(e). (1.14) 

Since x 1— > f{x) '= A^> a_ — ^^_n.{x) is a convex function, it is such that 

fix) > xf'iO) = xn{ [1 - exp(- A)] + 1 [exp(^) - l] }. 
Thus if we put 

7 = ^^Ii^l?^tM, c^is, 



exp(l)-l ' 

we obtain that f{x) > 0, 2; G H, and therefore that the left-hand side 
of equation ()1.14|) is non negative. We can moreover introduce the prior 
conditional distribution vf defined by 

— {m,d) = p r YT- 

avr 7r|m,exp[— /?<I>_iL o i?J j 

With P probability at least 1 — e, for any posterior distributions — > 
M^(Af) and p : O X Af ^ M^e), 

Pup{r) + iy[X{p,Tr)] = i^{X[p,7rc^p(-f3r)]} - log|7r[exp(-/3r)] | 
< iy{X[p, vrexp(_/3r)] } + P^T^ir) + V [X{Tf, vr)] 

= Z^{3<:[p,7rexp{_/3r)] } - Z^jlog VT [exp(-/?$_i oi?)] | 

+ f[aC(z.,77)-log(e)]. 

Thus, coming back to equation H1.14() . we see that under condition (jLlSf) . 
with P probability at least 1 — e, 
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< (A - I3)up{r) + z^{3<:[p,7rexp{-/3r)] } 

-V rvre,p(_„$ ,o^)($_iLoi?)da +(l + f)[3C(i/,7l) + log(|)]. 

.J/3 IV ^ J ' 

Noticing moreover that 

(A - I3)up{r) + l/{aC[/9,7rexp(_/3r)] } 

= Z^{aC [p, Vrcxp(-A-r)] } + / 7rexp{-ar) 

and choosing /? = '/rgxp(_Ar)5 'we have proved 



Theorem 1.21 For any positive real constants f3, 7 and ij, such that 
7 < r/[exp(-^) — 1] , defining A by condition (|1.15|) . so t/iai 

A = — A^log|l — ^[exp(-^) — l]|, with P probability at least 1 — e, /or 

any posterior distribution : 17 — > Mi',_(Af), any conditional posterior dis- 
tribution p-MxM^ Mi (6), 



^exp(— a"I> rj o_R) 
-TV 



[^_v_oR)da 



/3 



7rexp(-Qr)(r)c^a 



+ (i + f)[3<:(i.,7i) + iog(f)]. 



Let us remark that this theorem does not require that /? < 7, and thus 
provides both an upper and a lower bound for the quantity of interest: 

Corollary 1.22 For any positive real constants (3, 7 and r] such that 
max{/3,7} < r;[exp(-^) — l] , with P probability at least 1 — e, for any 
posterior distributions : Q ^ M\{M) and p : 0, x M ^ M]_{Q), 



-7Vlog{l-|.[exp{|p)-l]} 



exp(— ar) 



{r)da 



< V 



< V 



-iVlog{l-2[exp{i^)_l]} 



7rexp{-ar)('')rf" 

+ (l + f)[X(i.,7i) + log(f)]. 
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We can then remember that 



Vre^p(-a3._^oR)(^--2,0^)c^" 



to conclude that, putting 



G,(a) = -iVlog{l-^[exp(^)-l]}>a, 



(1.16) 



and 



^("^) =^ ^^^j^J^^^j where h{m) = ? 7i"exp(-ar) (™> 'r)da, (1.17) 



7 



the divergence of u with respect to the local prior ji is bounded by 
[l-e(l + f)]0C(i/,7i) 



71"exp(-ar)(^)c^" 

+ X(i., /x) - %{-[!, ^) + ^(2 + ^) log(f ) 



/■Gr,(7) 
/ 7rexp(-ar)(^^)'^a 



+ 3C(i/, /n) 



+ log<^ n 



exp( / T^e^p(-otr){r)da 



+ ^(2 + ^)log(f) 

+ e(2 + ^)iog(f). 



We have proved 



Theorem 1.23. For any positive constants j3, 7 and r] such that 
max{/3,7} < 77[exp(-^) — l] , with P probability at least 1 — e, for any 
posterior distribution 1/ : — > M,\. (M) and any conditional posterior distri- 
bution p-MxM ^ MUe), 



Olivier Catoni 



May 28, 2006 



1.3 Local bounds 



39 



< 



1-e 1 + 



+ 



+ ^(2 + ^)log(f)} 



X{i^, d) 



[G,(7)-7 + G^(/?)-/?]r* + log 



G,(/?)G,(7) 



/?7 

+ ?(2 + ^)log(f)}, 



where the local prior is defined by equation H1.13() on page \cl5\ and the local 
posterior u and the function Gn are defined by equation (|1.17|) above. 

We can then use this theorem to give a local version of Theorem ll.2UI (page 
134(1 . To get something pleasing to read, we can apply Theorem 11.231 with 
constants 7' and rj chosen so that ^, = 1, Gri{P') = /? and 7' = A, 

where /? and A are the constants appearing in Theorem ll.2Ul This gives 

Theorem 1.24. For any positive real constants (3 < \ and r] such that 
A < 77[exp(-^) — 1] , with P probability at least 1 — e, for any posterior 
distribution u : Q ^ J\[\^{M), for any conditional posterior distribution 
p-.QxM ^ MUe), 



X^^[up{R)] - /3^_p_[i^p{R)] <Bs{u,p), where 



Bz{y,p) = v 



G„(A) 



3 + 



G,7'(/3) 



'^c^pi-ar)ir)da 



X[u, p 



exp 



{3C(p,7r,,p(_,,)]} + (4+^^^ 



< u 



Gr,iX) 



log(f) 



dp 



'-^)x\u,p , , 

" ) L ' cxp[-(3- 

]}+ 4 + 



3 + ^ 



iog(!), 



and where the function Gn is defined by equation (|1.16() on page 
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A first remark: if we liad the stamina to use Cauchy Scliwarz inequalities (or 
more generally Holder inequalities) on exponential moments instead of using 
weighted union bounds on deviation inequalities, we could have replaced 
log(|) with — log(e) in the above inequalities. 

We see that we have achieved the desired kind of localization of Theorem 
11.201 (page I34() , since the new empirical entropy term 

cancels for a value of the posterior distribution on the index set u which is 
of the same form as the one minimizing the bound Bi{u, p) of Theorem ll.2UI 
(with a decreased constant, as could be expected). In a typical parametric 
setting, we will have 

^ T^eM-ar){'r)da ~ (A - P)r*{m) + log (^^^ deim), 

and therefore, if we choose for v the Dirac mass at 

m E argmmmgAf r*{m) + -^^de[m), 
and p{m, •) = 'n'cxp{-\r){''^^ ')> '^^ ^i^l S^t, in the case when the index set M 
is countable. 



B^{v, p) < max [G,(A) - G-\/3)] , (A - /?) 



3+ . 



r (m) + -j^de{m) 



+ (3 + s;^2)log E?|i-p 

ImeM 

X {(A-/3)[r*(m) -r*(m)] +log(^) [4(m) - 4(m)] } 

+ (4 + ^^)log(|). 

Therefore, as long as there are not too many of them, we do not feel strongly 
in this bound the models for which the penalized minimum empirical risk 

r*(m) + de{m) is far from optimal. 

1.4. Relative bounds. The behaviour of the minimum of the empiri- 
cal process 9 i— > r{6) is known to depend on the covariances between pairs 
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\r{9),r{9')\, 6,6' € 0. Accordingly, our previous study, based on the analy- 
sis of the variance of r{6) (or technically on some exponential moment play- 
ing quite the same role), is missing some accuracy in some circumstances 
(namely when inf e i? is not close enough to zero). In this subsection, in- 
stead of bounding the expected risk p{R), we are going to upper bound the 
difference p{R) — infe -R, and more generally p{R) — Rid), where G is 
some fixed parameter value. Eventually in the next subsection we will an- 
alyze p{R) — 7rexp(-/3_R) (^), allowing to compare the expected error rate of 
a posterior distribution p with the error rate of a Gibbs prior distribution. 
Thus relative bounds are not exactly of the same nature as previous ones: 
although it is not possible to estimate p{R) with an order of precision higher 
than {p{R)/Ny/'^, it is still possible in some situations to reach a better pre- 
cision for p{R) — infe R, as we will see. The study of PAC-Bayesian relative 
bounds stems from the second and third part of J. Y. Audibert's dissertation 

We will suggest two different kinds of applications of these bounds. The 
first more obvious one is to upper bound p{R) — infe R to get an idea of the 
performance of the posterior distribution p. 

The second application is to compare the classification model indexed 
by with a submodel indexed by one of its measurable subsets 0i C 0. 
For this purpose we are going to compare p{R), where /> : — > M3^(0) 
is any posterior distribution, with R{6), where G 0i is some possibly 
unobservable value of the parameter in the submodel defined by 0i. We 
will typically consider the case when 9 E arg mine^ R. In this special case, a 
negative bound for p{R)—R{9) = />(/?)— infe^ R indicates that it is definitely 
worth using a randomized estimator p supported by the larger parameter 
set instead of using only the classification model defined by the smaller 
set 01. 

Basic inequalities. Relative bounds in this section are based on the 
control of r{6) — r{6), where 6,9 ^ Q. These differences are related to the 
random variables 

i^i{9, 9) = a,{6) - aS) = 1 [fe{X.i) ^Yi]-t [/^(X,) / K,] . 

Some supplementary technical difficulties, as compared to the previous 
sections, come from the fact that ipi{d,9) takes three values, whereas ai{9) 
takes only two. Let r'{9,9) = r{6) - r(6) and R'{6,6) = R{6) - R(6). We 
have as usual from independence that 



May 28, 2006 



Olivier Catoni 



42 



1 Inductive PAC-Bayesian learning 



N ^ 

log{p [exp[-Ar'(0, ^)]] } = J] log{p [exp {-^i^iiQ, 0)]] } 



i=l 



f 1 ^ A ~ 1 

1=1 J 

Let Cj be the distribution of Tpi{9,9) under P and let C = j^YliLi^i ^ 
M3^({-1,0, 1}). With these notations 

log{p[exp[-Ar'(0,^)]]} < A^logjy exp(-A^)c(dV')|. (1.18) 

The right-hand side of this inequahty is a function of C. On the other 
hand, C being a probabihty measure on a three point set, is defined by two 
parameters, that we may take equal to J ^C{dip) and J ip'^C{dip). To this 
purpose, let us introduce 



M'{e,e) 



r 1 ^ ^ ^ 

/ V'C(rfV) = ^(+1) + Ci-l) = j^^F [^|;fi9, 6)] , 9,9 ee. 

i=i 

It is a pseudo distance (meaning that it is symmetric and satisfies the triangle 
inequality), since it can also be written as 

1=1 



It is readily seen that 

iVlogj I exp (-^A W) 



-X^x[R'i9,9),M'i9,9)], 



where 



^aip,m) = -a ^ log 



,-, N m+ p , . m — p , ■ 
[1 — m) -\ exp(— aj H exp(a) 



= —a ^ log|l — sinh(a) [p — rn-tanh(|)] |. 

Thus plugging this equality into inequality (|1.18jl we see that for any real 
parameter A, 

logjp exp[-Ar'(6',^)] } < -X^! x[B!{9,9),M' {9,9)], 
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To make a link with previous works initiated by Mammen and Tsybakov 
(see e.g. jSHllHl]), we may consider the pseudo distance D on @ defined on 
page El by equation ()1.2|) . This distance only depends on the distribution of 
the patterns. It is often used to formulate margin assumptions (in the sense 
of Mammen and Tsybakov). Here we are going to work rather with M': as 
it is dominated by D in the sense that M'{e,e) < D{e,0), 9,9 € G, with 
equality in the important case of binary classification, hypotheses formulated 
on D induce hypotheses on M' , and working with M' may only sharpen the 
results when compared to working with D. 

Using the same reasoning as in the previous section, we deduce 

Theorem 1.25. For any real parameter A, any 9 ^ Q, 



jexp 


sup A 




-pGMi (6) 



A [R!{;9),M'{; 9 )] ] - p[r'(., 9)] - %{p, vr 



< 1. 



We are now going to derive some variant of Theorem ll.251 In this theorem, 
we obtain an inequality comparing one observed quantity p\r'{-, 9 )] with two 
unobversed ones, p\^R'{-, 9 )] and p\_M'{-, 9 )] (because of the convexity of the 
function A^'^d 

JV 

\p{^.[R\-,9),M\-,9)]]>\^.{p[B'{-,9)],p[M'{-,9)]].) 



This may be inconvenient when looking for an empirical bound for p[i?'(-, 0)] , 
and we are going now to seek an inequality comparing p\^R'{-,9 )] with em- 
pirical quantities only. This is possible through a change of variables in the 
exponential inequality. Indeed, if we consider now random variables Xi(^i 
such that 

l-^V;, = exp (-^x. 
which is possible when G )— 1, 1( and leads to define 

X. = -f log (l-^^.), 
we obtain easily following the same reasoning as previously 



logjpjexp ^log(l-A^^^) || 
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TV 



i=l 



N 







< Nlog 









Let us replace for simplicity X/N with A. Let us also introduce the random 
pseudo distance 

^ 1 ^ 

i=l 



i=l 



This is the empirical counter part of M', since P(m') = M'. Let us notice 
that 

1 f:iog[i - = Mi^^i^Mi±^.v,«) 



^log(l-A)+log(l + A)^.^^^~^ 



With these notations, we can conveniently write the previous inequality as 



F{ exp 



-Nlog[l- XR'{9,e)] 



^iog(\^y{e,e) + ^iog{i-x')m'{e,9) 



< 1. 



Integrating with respect to a prior probability measure tt G M^(6), we 
obtain 

Theorem 1.26. For any real parameter X G )— 1, 1(, for any 6 & @, for any 
prior probability distribution tt G M^(0), 



P<^ exp 



sup {-Np\log[l-XR'{-,e)]} 
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''^og(l±4)p[r'{;'0)] 



1-A 



N 



< 1. 



1.4-2. Non random bounds. Let us first deduce a non random bound from 
Tlieorem 11.251 This tfieorem can be conveniently taken advantage of by 
throwing the non hnearity into a locahzed prior, considering the prior prob- 
abihty measure fi defined by 



dfj. 



(0) 



exp{ -A^- A [R'{0, e),M'{0, e)]+ (3R'{9, 9)} 



d^'"' ^{exp{-A^A [i?'(-,0),M'(-,0)] +f3R'{;9)}y 
Indeed, for any posterior distribution p : Q ^ M3,_(0), 

X{p, p) = %{p, tt) + Ap{^ . [R'i; e),M'i; 9)]]- fip[R'i; 9 )] 

+ log{7r [exp{-A* . [R'{; 9 ), M'(-, ^)] + f]R'{; 9)]}]}. 

Plugging this into Theorem 11.251 and using the convexity of the exponential 
function, we see that for any posterior probability distribution p : — > 

MV(e), 

PF{p[R'{;9 )] } < XF{p[r'i.,0)] } + P [X{p, vr)] 

+ log{7r [expl-A^- A [R'i, 9 ), M'(-, 6)] + (3R'{-, ^)] }] }• 

We can then recall that 

Xp[r'{;e)] +X{p,7r)=X[p,7T,,p^_^,^] -log{7r[exp[-Ar'(.,^)]]}, 

and notice moreover that 

-p|log{7r[exp[-Ar'(-,^)]]}| < - log{7r [exp[-Ai?'(-, ^)]] }, 

since R' = p(r') and h ^ log|7r[exp(/i)] | is a convex functional. Putting 
these two remarks together, we obtain 
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Theorem 1.27. For any real positive parameter X, for any prior distribu- 
tion TT G MY(0), for any posterior distribution p : O — > M.\_{Q), 

F{p[R'i;e)]}<-F[X{p 

1 ^exp(— Ar) )\ 

+ i log{7r [exp{-A* A [R'{; 9 ), M'(-, 9)]+ 0)]}]} 



< -^P[0C(/>,7rexp(-Ar))] 



-^log{7r[exp[-Aii'(-,e)]]} 



+ ^ log{7r [exp{- [iVsinh(A) _ p] Q 



+ 



2Arsinh(2^fM'(.,^)}]} 
-ilog{7r[exp[-Ai2'(-,^)]]}. 



It may be interesting to derive some more suggestive (but slightly weaker) 
bound in the important case when Oi = and R{0) = info R. In this case, 
it is convenient to introduce the margin function 



(p{x) = supM'{e,e)-xR'{e,e), xeR-^ 

eee 



(1.20) 



We see that (p is convex and nonnegative on ]R,_|_. Using the bound M'{9, 9) < 
xR'{6, ) + ^{x), we obtain 

! '^exp(— Ar) jj 

exp{-{jVsmh(A)[l-xtanh(5^)] - /S)iJ'(., « )} 
^"°"''^"'''^V w-ilog{.[exp[-Ai;'(..?)]]}. 



+ 



Let us make the change of variable 7 = iVsinh(-^) [l — xtanh(2^)] — /3 to 
obtain 



Corollary 1.28. For any real positive parameters x, 7 and A such that 
X < tanh(2^)-^ and < 7 < A/"sinh(;^) [l - xtanh(2^)], 
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+ iVsinh(A) tanh(2^)^(x) + P[3C(/),7re,p(_Ar))] 



Let us remark that these results, although well suited to study Mammen 
and Tsybakov's margin assumptions, hold in the general case: introducing 
the convex expected margin function (/9 is a substitute for making hypotheses 
about the relations between R and D. 

Using the fact that R'{9,0) > 0, 6* G 9 and that ip{x) > 0, x € R+, we 
can weaken and simplify even more the preceding corollary to get 

Corollary 1.29. For any real parameters (3, A and x such that x > and 
< /3 < A — x^, for any posterior distribution /> : — >■ M5|_(B), 



Let us apply this bound under the margin assumption first considered by 
Mammen and Tsybakov which tells that for some real positive 

constant c and some real exponent k > 1, 



V[p{R)\ <infi? 




R:{e,e) > cD{e,e) 



K 



(1.21) 



In the case when k = 1, then ip{c ^) = 0, proving that 




ff3 7rexp(-7i?) [R'{-,G)]dl 



< 



!(3 ^exp(-7fl) [R'{-,^)\dl 



Taking for example A = ^,/?=2=^,we obtain 



cJV 
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If moreover the behaviour of the prior distribution vr is parametric meaning 
that 7^exp{-f3R)[R'i',d)] ^ ^) for some positive real constant d hnked with 
the dimension of the classification model, then 

r .r.M . 81og(2)(i . 5.55d 

In the case when k> 1, 

K 1 _-, 1 

^p{x)<{k — 1)k -^-1 (cx) «-i = (1 — k ){kcx) -^-1, 



thus P{7rexp(-Ar) [^'(•,^)]} 

Taking for instance /3 = I, x = and putting 6 = (1 — K~^){cK)~'i^ , we 
obtain 



P [7rexp(-Ar) (R)] - miR<- j^^^ 7rexp(-7R) e)]d^ + b J . 

In the parametric case when 7rexp(_-yR) ^ )] < ^, we get 

pKp(-..,(H)]-mfH<ii^M?M + 

Taking 

A = 2"^ [8 log(2)d] ^ (kc) at , 

we obtain 

We see that this formula coincides with the result for k = 1. We can thus 
reduce the two cases to a single one and state 

Corollary 1.30. Let us assume that for some 9 & Q, some positive real 
constant c, some real exponent n > 1 and for any 9 £ Q, R{9) > R{9) + 
cD{9, 9Y . Let us also assume that for some positive real constant d and any 
positive real parameter 7, T^ex^i-iR)^-^^ ~ R ^ ^- Then 
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P TT 



{ -2- 1 [8 log(2)d] (kc) Af^jfcr r } 



K-l 1 



{R) 



L exp 



< infi? + (2 - k-^){kc) 




Let us remark that the exponent of is this corollary is known to be the 
minimax exponent under these assumptions: it is unimprovable, whatever 
estimator is used in place of the Gibbs posterior shown here (at least in the 
worst case compatible with the hypotheses). The interest of the corollary 
is to show not only the minimax exponent in A^, but also an explicit non 
asymptotic bound with reasonable and simple constants. It is also clear that 
we could have got slightly better constants if we had kept the full strength 



of Theorem 11.271 (page I46|l instead of using the weaker Corollary 11.291 (page 



We will prove in the following empirical bounds showing how the constant 
A can be estimated from the data instead of being chosen according to some 
margin and complexity assumptions. 

1.4-3. Unbiased empirical bounds. We are going to provide an empirical 
counter part for the expected margin function (p. It will appear in empiri- 
cal bounds having otherwise the same structure as the non random bound 
we just proved. Anyhow, we will not launch into trying to compare the be- 
haviour of our proposed empirical margin function with the expected margin 
function, since the margin function involves taking a supremum which is not 
straightforward to handle. 

Let us start as in the previous subsection with the inequality 



We have already defined by equation H1.19() the empirical pseudo distance 



pv[p[R'{-, e)]]< p{Ap[r'(., e )] + X{p, tt)} 




m 



i=l 




convexity oi h ^ 
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log{7r [exp{-A* X [R'{; e),M'{; e )] + PR'i; e)}]] 
< log|7r[exp{-Arsinh(A)i2'(.,^) 

+ N sinh( A) tanh(2^)M'(-, 9 ) + 9)]}]] 
< p|log|7r [exp{- [AT sinh( A) _ /?]/(■, ^) 

+ iVsinh(A)tanh(2^)m'(-,^)}]} 
We may moreover remark that 

Xp[r'{.,9)]+ X{p, vr) = [/3 - AT sinh( A) + a] p [/(•, ^)] 

- logjTT [exp{- [iVsinh(A) _ ^]/(., 9)}]]. 

This ends to prove 

Theorem 1.31. For any positive real parameters (3 and X, for any posterior 
distribution p : $7 ^ M^CO), 



P{p[i?'(-,e)]}<P 



iVsinh(A) - A 



p[r'{-X, 



+ 



0C[p,7r, 



exp{— [iV sinh( 



+ 



r'log{7r^ 



:p{-[Afsinh(A)_^]r} 



cxp [iV sinh ( A ) t anh ( ) m' ( • , 6* 



}}■ 



Taking P = y siiih(-^), using the fact that sinh(a) > a, a > and expressing 
tanh(|) = [i/l + sinh(a)2 - l] and a = log[i/l + sinh(a)2 + sinh(a)] , 
we deduce 

Corollary 1.32. For any positive real constant f3 and any posterior dis- 
tribution p -.0. ^ M^(0), 



P{p[i?'(-,e)]}<P< 



flog(AA^ + f)-i 



<i 



Olivier Catoni 



May 28, 2006 



1.4 Relative bounds 



51 




This theorem and its corohary are really anologous to Theorem 11.271 (page 
146 p and it could easily be proved that under Mammen and Tsybakov margin 
assumptions, we obtain an upper bound of the same order as Corollary II. 3UI 
f page US)) . Anyhow, in order to obtain an empirical bound, we are going now 
to take a supremum over all possible values of 9, that is over @i. Although 
we believe that taking this supremum will not spoil the bound in cases when 
overfitting remains under control, we will not try to investigate precisely if 
and when this is actually true, and provide our empirical bound as such. Let 
us only say that on a qualitative ground, the values of the margin function 
quantify how steep is the contrast function R or its empirical counterpart r, 
and that the definition of the empirical margin function is obtained by sub- 
stituting P, the true sample distribution, with P = (-^ ^iLi ^{Xi,Yi))^^ ^ 
empirical sample distribution, in the definition of the expected margin func- 
tion. Therefore, on qualitative grounds, it sounds like hopeless to presume 
that R is steep when r is not, or in other words that a classification model 
that would be unefficient at estimating a bootstrapped sample according 
to our non random bound would be by some miracle efficient at estimating 
the true sample distribution according to the same bound. To this extent, 
we feel that our empirical bounds bring a satisfactory counterpart of our 
non random bounds. Anyhow, we will also produce estimators which can 
be proved to be adaptive using PAC-Bayesian tools in the next subsection, 
at the price of a more sophisticated construction involving comparisons be- 
tween a posterior distribution and a Gibbs prior distribution. 

Let us restrict now to the important case when 9 G arg minei R- To 
obtain an observable bound, let 9 G argmingge r(0) and let us introduce 
the empirical margin functions 

'ip{x) = su.pm'{9,9) — x[r{9) — r{9)] , x € ]R,+ , 
6»ee 

ip{x) = sup m'{9, 9) — X [r{9) — r{9)] , x S ]R,+ . 
eeei 

Using the fact that m'{9, 9) < m'{9, 9) + m'{9, 9), we get 

Corollary 1.33. For any positive real parameters (3 and A, for any pos- 
terior distribution p : Q M5(_(0), 
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P[p(i?)] -infi?< P 



1 - 



Afsinh(A)_A 



[p{r)-ri9)] 



+ 



exp{-[JVsmh(A)-/3]r}J 



+ 



'exp{-[7Vsmh(A)_/3]r} 



+/3-^A^sinh(A)tanh(2^)(^ 



exp[iVsinh(;^) tanh(2^)m'(-,^)] | 
P / Afsmh(A)-A\ 



A^sinh(;^)tanh(2^ 



f3 



Taking /3 = ^ sinh(-^), we also obtain 



P[p(i?)]-infi?<P f log(^l + ^g + f)-l [p(r)-rW] 



<i 



+ ^|3<^[p>71"exp(-/3r)] 

+ log 7rexp(-/3r){exp 



+ 



0< 



1 + 



4/32 



Note that we could also use the upper bound m'(6, 9) < x^r{0)—r{9)^ +(p(x) 
and put a = A^sinh(-^) [l — xtanh(2^)] — P, to obtain 

Corollary 1.34. For any non negative real parameters x, a and \, such 
that a < A''sinh(-^) [l — xtanh(2^)] , for any posterior distribution p : U 



F[p{R)] -MR 



< P< 



^ Arsinh(^)[l -a;tanh(2^)] - A 
A/"sinh(;^) [1 - a::tanh(2^)] - a, 

[Pj ^exp(— ar)] 

A/"sinh(^)[l - xtanh(2^)] - a 

iVsinh(A)tanh(^) 
A/"sinh(;^)[l - xtanh(2^)] - a 



[p{r)-r{e)] 



+ 



+ 
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(fix) + if 



( A — a 



Wsinh(A)tanh(2^; 



Let us notice that in the case when 0i = G, the upper bound provided by 
this corollary has the same general form as the upper bound provided by 
Corollary 11.281 (page I46|) , with the sample distribution P replaced with the 
empirical distribution of the sample P = X^^^i '^(Xi,y,))'^ • Therefore, 
our empirical bound can be of a larger order of magnitude than our non 
random bound only in the case when our non random bound applied to the 
bootstrapped sample distribution P would be of a larger order of magnitude 
than when applied to the true sample distribution P. In other words, we 
can say that our empirical bound is close to our non random bound in every 
situation where the bootstrapped sample distribution P is not harder to 
bound than the true sample distribution P. Although this does not prove 
that our empirical bound is always of the same order as our non random 
bound, this is a good qualitative hint that this will be the case in most 
practical situations of interest, since in situations of "underfitting" , if they 
exist, it is likely that the choice of the classification model is inappropriate 
to the data and should be modified. 

Another reassuring remark is that the empirical margin functions ^ and 
kp behave well in the case when infe r = 0. Indeed in this case m'{9,9) = 
r'{e, 6) = r{9), 6* G 0, and thus ^(1) = lp{l) = 0, and 
'f{x) < —{x — 1) infe^ r, X > 1. 
This shows that we recover in this case the same accuracy as with non rela- 
tive local empirical bounds. Thus the bound of Corollarv II .341 does not col- 
lapse in presence of massive overfitting in the larger model, causing r{9) = 0, 
which is another hint that this may be an accurate bound in many situations. 

1.4-4- Relative empirical deviation bounds. It is natural to make use of 
Theorem 11.261 on page I44l to obtain empirical deviation bounds, since this 
theorem provides an empirical variance term. 

Theorem 11.261 is written in a way which exploits the fact that ipi takes 
only the three values -1, and -|-1. However, it will be more convenient for 
the following computations to use it in its more general form, which only 
makes use of the fact that £ With notations to be explained 

hereafter, it can indeed also be written as 



P<^ exp 



sup I -iVpjlog [l - AP(V')1 I 
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+ Np[p[logil-X^]-X{p,7r) 



< 1- (1-22) 



We have used the following notations in this inequality. We have put 



N 



1=1 



so that P is our notation for the empirical distribution of the process 
{Xi,Yi)^^. Moreover we have also used 



N 



i=l 

where it should be remembered that the joint distribution of the process 
(Xi, Yi)fL^ is P = (g)^^i Pi. We have considered ip{e, 9) as a function defined 

on X X y, ^ 

as = l[y//e(x)] gXxV 

so that it should be understood that 

N 



, N 

= ]^ E ^ /^(^^)] -^[Yi^ hiXi)] } = R'{e, e). 

i=l 



In the same way 



N 



log(i-AV') =-Y.^og[i-\^i{e,e)\. 



i=l 



Moreover integration with respect to p bears on the index 9, so that 

I' ( X ^ ~ 1 

p{log [l - AP(V^)] } = J^^^ logj 1 - ^ E [^^(^' ^)] jP(d^)^ 



p{p[log(l - Xi^)] } = y^^^l^ i;iog[l - Xi^i{9,9 



)] \pid0). 



We have chosen concise notations, as we did throughout these notes, in 
order to make the computations easier to follow. 
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To get an alternate version of empirical relative deviation bounds, we 
need to find some convenient way to localize the choice of the prior distri- 
bution TT in equation (|1.221 page ISl]) . Here we propose to replace vr with 
At = 7rexp{-Ariog[i+/3P(V)]}) which can also be written 7r^xp{-Ariog[i+/3iJ'(-,0)]}- 
Indeed we see that 

%{p, /i) = Np\\og[l + } + %{p, vr) 

+ log{^ [exp{ -N log [1 + f3P{iP)\ }] } . 

Moreover, we deduce from our deviation inequality applied to —V', that (as 
long as /3 > — 1), 



pjexp A^^|p[log(l + } - iV^{log[l + pPii^)] } 



< 1. 



Thus 



exp 



log{7r[exp{-7Vlog[l + /?P(V')] }] } 

- logjvr [exp{-iVP[log(l + }] } 



< F< exp 



-Np{log[l + pP{i;)]]-X{p,7r) 

+ Np\p[log{l + Pi;)]]+X{fi,TT) 



< 1. 



This can be used to handle 3C{p,p), making use of the Cauchy Schwarz 
inequality as follows 



P< exp 



-iVlog{ (l - Xp[P{ij)]) (i + Pp[pW] ) } 



< P< exp 



+iV/>{p[log(l - AV)]} 

- X{p, vr) - logjvr [exp{ -iVP[log(l + Pi;)] }] } 

N\og[(l-Xp[P{i^)])} 

1 ^ V2 

Np\p[log{l-X^)\]-X{p,p) 
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X P< exp 



logjTT [exp{-Ariog[l + /3P(V')] }] } 

- log{7r[exp{-Afp[log(l + }] } 



1/2 



< 1. 



This implies that with P probabiUty at least 1 — e, 

- AT log{ (l - Ap [P(V')] ) (l + /3p [P(V)] ) } 
<-iVp{F[log(l-A./')]} 

+ X{p,7r) + log{7r[exp{-iVP[log(l + /J^)] }] } - 21og(e). 
It is now convenient to remember that 



log(l - AV') 



We thus can write the previous inequality as 
- N log{ (l - Xp [R'{; e)] ) (l + Pp[R'{; 6)] ) } 

< ^ log f P[r'i; 0)] - ^ log(l - A2)p[m'(-, 9)] + Xip, n) 



1-A 

+ log<! TT 



^log(l-/?V(.,^)} 



21og(e). 



Let us assume now that 9 G argmine^ R. Let us introduce 9 £ arg miner. 
Decomposing r'{9, 9) = r'{9, ff) + r'{9, 9) and considering that 

m'{9,9) < m'{9,9) +m'(9,9), 
we see that with P probability at least 1 — e, for any posterior distribution 
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- iog{ (i - \p[B!{; e)] ) (i + /3p [R'{; e)) } 



^1 

<ylog 



+ log<^ TT 



log(l-A>[m'(-,^)] +X(p,7r) 

exp{-f log(i±f ) [/(•, ^)] - f log(l - ^)} 

[r(^)-r(^)] 



+ f log 



(l+A)(l-/3) 

(i-A)(i+/;() 



f log[(l - A2)(1 - - 21og(e). 



Let us now define for simplicity the posterior z/ : O — > JA\{Q) by the 
identity 

exp{-f log(i±A)r'(^, 9) + ^ log(l - A2)m'(0, ^)} 



TT ^exp{-f log(i±A)^/(., ^) + I iog(i - A2)m'(-, 9)] 
Let us also introduce the random bound 



B = —log{v 



1 



exp 



flog 



(l+A)(l-/3) 
(l-A)(l+/3) 



+ sup - log 



(l-A)(l+/3) 
(l+A)(l-/3) 



r'(-,e) 

-f log[(l-A2)(l-/32)]m'(-,r 



- ^log[(l - A2)(1 - /?2)]m'(e,^) - ^log(6). 



Theorem 1.35. Using the above notations, for any real constants < P < 
A < 1, for any prior distribution it G M^(0), for any subset Qi C 0, with 
P probability at least 1 — e, for any posterior distribution p : — > M^(0), 

- log{ (l - A [p{R) - inf i?] ) (l + /? [p{R) - inf R])} < + B. 

Therefore, 
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p(R)-mfR 



< 



2A/3 



'1 + 4 



(A - 



1 - exp - 



< 



x-p 



B + 



Let us define the posterior u by the identity 



exp[^-f log 




r'(^,^)-f log(l-/32)m'(e,^) 




7r|exp 


-flog 


( 1+/3 


)^/(.,^)_|log(l-/32)^'(.,^)" 


} 



It is useful to remark that 



^log<l^ 



exp 



N 

ylog 



(l-A)(l + /?)r^'^^ 
AT 



-log[(l-A2)(l-/32)]m'(-,0) 



<4ilogfii±Ml^V'(-^) 
- \2 ^l(l-A)(l + /3)>'^' ^ 

-^log[(l-A2)(l-/32)]m'(-,^)}. 

Let us introduce as previously ^(a;) = sup^igQ m'(0, 0) — xr'{6,6), x G ]R,+ . 
Let us moreover consider (f{x) = supg^Q^ m'{6, 6) —xr'{6, 6), x € R+. These 
functions can be used to produce a result which is slightly weaker, but maybe 
easier to read and understand. Indeed, comming back a little while, we see 
that, for any x G 11+ , with P probability at least 1 — e, for any posterior 
distribution p, 
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N log{ (l - Xp [R'i; 0)] )(i + Pp [R'{; 6)] ) } 
(1 + A) 



<ylog 



L(l-A)(l-A2)^ 



p[r\;9)] 



^log[(l-A2)(l-^2)]^(x)+aC(p,7r) 



+ log|7r expj 



N 
' 2 



log 



(1+/3) 



(l-/3)(l-/3^)- 



_ log[(i-A2)(l-/32)]^ 



log 



(l+A)(l-/3) 



■log[(l-A2)(l-^2)] 

-21og(e) 



(l+A) 



flog[ 



(1-A)(1-A^)^ 
(l-/3)(l-/3^)^ 



^exp(-ar) [r'{-,d)]da 



^log[(l-A2)(l-/32)] 



log 



(l+A)(l-/3) 
(l-A){l+/3) 



-log[(l-A2)(l-/32)] 



Theorem 1.36. W^ji/i the previous notations, for any real constants < 

13 < X < 1, for any positive real constant x, for any prior probability distri- 
bution TT G M^(B), for any subset Qi C 0, with P probability at least 1 — e, 
for any posterior distribution p : O — > M^(6), putting 



B{p) 



N{X 



N 



log[ 
log[ 



(l+A) 



(1_A){1-A'')^ 



7rexp(-Qr) [r'{-,9)\da 



3C(p,7r, 



+ 



exp{--f log[ 



(l+A) ■ 
(1-A){1-A^)^ ■ 



)-21og(e) 



7V(A-^) 



< 



2(A-/3) 
1 



log[(l-A^) (1-/32)] 
log 



log 



(l+A)(l-/3) 
(l-A)(l+/3) 



.log[(l-A2)(l-/32)] 



iV(A-/3) 



Cie log 



(1+A) 
(l-A)(l-A^)-_ 



,^Ogl,(l-/3)(l-/3^)- 
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+ 



3^(P>\xp{-i^iosf (1+^), ]r]^ - 21og(e) 



2(A-/3) 



log[(l-A2)(l-/32 



iV(A-/3) 



log 



(l+A)(l-/3) 



■log[(l-A2)(l-/32)] 



i/ie following bounds hold true: 



p(R)-miR 



< 



A-/? 
2XP 



'1 + 



AXIS 



(A - 



:{l-exp[-(A-/3)S(p)]} 



Let us remark that this alternative way of handling relative deviation bounds 
made it possible to carry on with non linear bounds up to the final result. 
(For instance, if A = 0.5, /3 = 0.2 and -B(p) = 0.1, the non linear bound 
gives p{R) - infoi R < 0.096.) 



1.5. Bounds relative to a Gibbs distribution. The empirical bounds 
of the previous section involve taking suprema in ^ G 0, and replacing the 
margin function (p by some empirical counter parts Tp or ip, which may prove 
unsafe when using very complex classification models. Moreover, they are 
not easy to analyze with PAC-Bayesian tools. To remedy these weaknesses, 
we are going now to propose another type of relative bounds. We will first 
explain how to compare the expected error rate p{R) of any posterior distri- 
bution /? : n — > M5|_(0) with 7rcxp(-/3R)(-R), the expected risk of a Gibbs prior 
distribution. We will then show how to analyze the behaviour of this bound. 
This will provide an estimator proven to reach adaptively the best possi- 
ble asymptotic behaviour of the error rate under Mammen and Tsybakov 
margin assumptions and parametric complexity assumptions. 

Then, we will provide an empirical bound for the Kullback divergence 
^(P) ^exp(-/3ii)) of ^ posterior distribution with respect to a Gibbs prior, 
making use of relative deviation inequalities. 

To tackle the question of model selection, we will estimate the relative 
performance of one posterior distribution with respect to another, which is 
useful when the two posteriors are supported by different models. 

Eventually, we will propose a more integrated approach to model selec- 
tion, showing how to build a two step localization strategy, in which the 
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performance of the posterior distribution to be analyzed is compared with 
some two step Gibbs prior. 

1.5.1. Comparing a posterior distribution with a Gibbs prior. Similarly to 
Theorem 11.261 we can prove that for any prior distribution vr € Jyl\{Q), 



P< TT (g) 7r<^ 



exp 



-iVlog(l-AiiO 



N 

y 



< 1. (1.23) 



Replacing tt with vrexp(-/3R) and considering the posterior distribution p (g) 
7rgxp{-/3_R) , provides a starting point in the comparison of p with Trexp{-/3R) 'i 
we can indeed state with P probability at least 1 — e that 



7Vlog<^ 1 - A 



p{R) - vr, 



exp(-/3R) 



< — log ( T 1 [Pi^) 



1 - A 



vr, 



exp{-/3_R) 



(r)] 



N 



— log(l-A )p(8)vrexp{_/3i?)(m') 

+ 3^[/5,7rexp(-^fi)] -log(e). (1.24) 

Using the parameter J = y (l^) ' ^^^^ ~ tanh (■^) and — ^ log(l — 
A^) = log [cosh (-^)] , and noticing that 

^[p,T^eM~m] = l^iPiR) - ^exp(-/3i?)(^)] 

+ aC(p,7r) -3^[^exp(-/3fl),vr], (1.25) 
makes a step further in the proper handling of the entropy term: 

-7Vlog|l-tanh(^) p(i?) - vrexp{-/3ij) (-R) } - /? K-R) - vrexp(_/3i?) (i?)] 
<7[/)(r) - vrexp(-/3R)(r)] + iVlog[cosh(^)]/3 O 7rexp(-/3i?)(m') 

+ aC(/5,7r)-D^[^,xp(-/3fi),H -log(e). (1.26) 

We can then decompose in the right-hand side 7 [p{r) — TTexp{-i3R) ('')] into 
(7 - A) [pir) - vrexp(-/3R)(r)] +X[p{r) - vrexp(-/3ij) (r)] and use the fact that 
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A [p{r) - 7rexp(-/3R) (r)] + log [cosh(;^)] p (2> 7rexp(-/3i?) (m') 

+ X{p, tt) -X [7rcxp(-/3R) , tt] 

< A/9(r) +3C(/9,7r) + log|7r[exp{-Ar + iVIog[cosh(;^)]p(m')}]} 

= 0<:[/9,7rexp{_Ar)] + log|7rexp(-Ar) exp{Anog [cosh(^)]p(mO}]}, 

to get rid of the appearance of the unobserved Gibbs prior TTexp^-fSR) in most 
places of the right-hand side of our inequality, leading to 

Theorem 1.37. For any real constants (3 and 7, with P probability at least 
1 — e, for any posterior distribution p : ^ M^(0), for any real constant 
A, 

[iVtanh(^) - (3] [p{R) - 7rexp(-/3fl) (i^)] 

< -TVlogjl - tanh(^) [p{R) - 7rexp(-/3fl) (i?)] } 

- P[p{R) - Tre^p(-I3R){R)] 

< (7 - A) [p(r) - 7rexp(-/3i?)(r)] +X[p, 7rexp(-Ar)] 

+ log|7rexp(-Ar) cxp{ iV log [cosh(;^ )] p(m') }] } -log(e) 

exp(-7r) exp{(7- A)r + A^log[cosh(;^)]p(m')} } 

- (7 - A)7rexp(_/3R)(r) - log(e). 



+ logjTT, 



We would like to have a fully empirical upper bound even in the case when 
A 7^ 7. This can be done by using the theorem twice. We will need a lemma 

Lemma 1.38 For any probability distribution tt G M+(0), for any bounded 
measurable functions g,h : @ ^ M., 



vr, 



exp 



(-a) (5) - 7rexp(-/i) (5) < 7rexp(-g) (/i) - Vrexp(-/i) (/)-)• 



Proof. Let us notice that 



< 3C(7rexp( -g)i'^exp(-h)) — '^cxp{-g){h)+log{'K[ex.p{ h)] } +3C(7rexp(-g) , Vt) 



= TT, 



exp 



~ '^eM-h)i^) ~ 3<^(7I"exp(-/i) , 71") + 3C(7rexp(-<;) , Vr) 



7rexp(-g) {h) -TTexp{-h) {h) -3<^(7rexp(-/») , Tt) -7rexp{-g) {q) -log{ TT [exp(-5)] } . 
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Moreover 



-log{7r[exp(-5r)]} < TTc:,p(-h){9) + ^{T^cxp{-h),-^), 

which achieves the proof. □ 

For any positive real constants /3 and A, we can then apply Theorem II. 371 

to 

P ~ '''"cxp(-Ar) 5 use the inequality 

■p [^exp(-Ar) (r) - Vrexp(-/3R) (r)] < 7rexp(-Ar) (R) " T^cxpi-^R) {R) (1-27) 

provided by the previous lemma. We thus obtain with P probability at least 
1 - e 

- TVlogjl - tanh(^)^ iTc^p(-Xr)ir) - irexp{-pR){r) } 

- 7Kxp(-Ar)('^) - T^eM-m^^)] 
< log|7rexp(-Ar) exp{ iV log [cOsh( ;^)] 7rexp(-Ar) ("l') } }-log(e). 

Let us introduce the convex function 

Fy,a{x) = -iVlog[l - tanh(;^)x] - ax > [A^ tanh(;^) - a]x. 
With P probability at least 1 — e, 



7rexp(-/3ij) (?^) < inf^ <{ -7rcxp(-Ar)('^) 



A 7 



logjvr, 



cxp(— Ar) 



exp{ log [cosh(^)] 7rexp(_Ar) {m') } I 

- log(e) 



Since Theorem II .371 holds uniformly for any posterior distribution p, we can 
apply it again to some arbitrary posterior distribution p. We can moreover 
make the result uniform in /? and 7 by considering some atomic measure 
u € M5|_(]R,) on the real line and using a union bound. This leads to 

Theorem 1.39. For any atomic probability distribution on the positive 
real line v G M^(Il4.), with P probability at least 1 — e, /or any posterior 
distribution p : ft ^ Mi_(0), for any positive real constants (3 and 7, 
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[iVtanh(^) - 13] [p{R) - -JieM-miR)] 

< F^^p[p{R) - 7rexp(_/3i?)(i?)] < B{p,P,j), where 

Bip,fi,l) = 



inf 

AieIR,+ ,Ai<7 

A2GlR„A2>^tanh{-X)-i 



\oc[p, 



^exp(— Air) 



+ (7 - Ai) [pir) - Vrexp(_A2r)(r)] 



cxp(-Air) 





exp{iVlog[cosh(;^)]/)(m')} | - log[eu{(3)i^{j)] 
log< 



vr, 



exp{— A2r) 



expjiV log [cosh(;^)] vrexp(-A2r) (m') } | 



-log[eu{P)u{j)] 



< 



inf 

AigIR,+ ,Ai<7 
A2eR,A2>421tanh(-i)-i 



X[p,TT, 



exp(-Air)J 



+ (7 - Al) [p{r) - 7rexp(-A2r)(r)] 

+ log|7rcxp(-Air) exp{A^log [cosh(^)]p(m')}]} 

exp{ log [cosh(^)] 7rexp(-A2r) {m') } | 



-{i + £ i..:-I^ji, }'°^MW7)]|. 

where we have written for short and 1^(7) instead ofv{{(3}) and i^({7}). 

Let us notice that B{p,(3,^) = +00 when = or 1/(7) = 0, the unifor- 
mity in P and 7 of the theorem therefore necessarily bears on a countable 
number of values of these parameters. We can typically choose for v distri- 
butions such as the one used in Theorem 11.111 on page 1211 namely we can 
put for some positive real ratio a > 1 

1 



or alternatively, since we are interested in values of the parameters less than 
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N, we can prefer 

log(a) n ^ . iQg(^) 

^ ' log(aiV) ~ log(a) 

We can also use such a coding distribution on dyadic numbers as the one 
defined by equation p.6|) on page |2S1 

1.5.2. The effective temperature of a posterior distribution. Using the para- 
metric approximation '/rgxp{-ar) (^) ~ ii^fe r ~ we get as an order of mag- 
nitude 



5(^exp{-Air.),/3,7) < -(7-Al)de[A2-^ - Ar^] 

Ai 



+ 24 log 



Ai — log [cosh(;^)] X 



o/5 ,^ I ^ 



A2 [^tanh(^) - f ] ' " - iVlog[cosh(^)] 

/3 (1-^) ■ 



2iVlog[cosh(^)] 



1 + 



A2 [^tanh{p - ^ 



A2[ftanh(^, 



Therefore, if the empirical dimension stays bounded when increases, we 
are going to obtain a negative upper bound for any values of the constants 
Ai > A2 > /?, as soon as 7 and ^ chosen to be large enough. This 
ability to obtain negative values for the bound -B('/rexp(-Air)i 7) and more 
generally B{p,^,l3), leads the way to introducing the new concept of the 
effective temperature of an estimator. 

Definition 1.1 For any posterior distribution p : ^ M5|_(0) we define 
the effective temperature T{p) G K, U {—00, -|-oo} of p by the equation 

P(^)=-c.p(-4,)(^)- 

Note that /3 1-^ '^exp{-i3R){R) ■ U {—00, +00} — > (0,1) is continuous and 
strictly decreasing from ess sup^^ R to ess inf^ R (as soon as these two bounds 
do not coincide). This shows that the effective temperature T(p) is a well 
defined random variable. 

Theorem \l.'A9\ provides a bound for T{p), indeed: 
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Proposition 1.40. Let 

m = sup{/? G H; inf B{p, f3, 7) < O} , 

7,Aftanh{-X)>/3 

where B{p,(3,'y) is as in Theorem li.^M Then with P probability at least 
1 — e, /or an|/ posterior distribution p : Q — !M^(0), T(p) < I3{p)~^ , or 
equivalently p{R) < 7rgxp[-^(p)i?] (^)- 

This notion of effective temperature of a (randomized) estimator p is inter- 
esting for two reasons: 

• the difference p{R) — T^expi-fSR) (R) can be estimated with a better accu- 
racy than p{R) itself, due to the use of relative deviation inequalities, leading 
to convergence rates up to in favourable situations, even when infe R 
is not close to zero; 

• and of course T^cxp{-i3R) (R) is a decreasing function of /3, thus being able 
to estimate p{R) — T^exp(-i3R) i^) with some given accuracy, means being able 
to discriminate between values of p{R) with the same accuracy, although 
doing so through the parametrization /3 i-^^ Trexp{-f3R){R)i which cannot be 
observed nor estimated with the same precision! 

1.5.3. Analysis of an empirical bound for the effective temperature. We are 
now going to launch into a mathematically rigorous analysis of the bound 
-S(^exp(-Air),/3,7) provided by Theorem 11.391 to show that 
i^^p£M\{e) '^exp[-^{p)R] ('^) converges indeed to infe R at some unimprovable 
rates in favourable situations. 

It is more convenient for this purpose to use deviation inequalities involv- 
ing M' rather than m' . It is straightforward to extend Theorem ll.25l on page 

EHlto 

Theorem 1.41. For any real constants [3 and 7, for any prior distribution 
p € JA\{Q), with P probability at least 1 — 1], for any posterior distribution 

jp^7r^^p^_f3R)[^j.{R',M')] < 7p(g)7re,p(_^^)(r') +3<:(/?,/x) -log(r/). 

In order to transform the left-hand side into a linear expression and in the 
same time to localize this theorem, let us choose p defined by its density 
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^(^i) = C-iexp -f3R{0i) 

-1 J^{^j.[R'{ei,e2),M' {01,92)] 

- f Sinh(^)i?'(^i, 92)}7T,^p^_pn){de2) 

where C is such that /Lt(9) = 1. We get 

%{p, fi) = Pp{R) + 7P ® ^cxp(-/3R) (R', M') - f sinh(^)i?'] + %{p, vr 



+ log{ I exp 



JV 

7 



Sinh(^)i?'(01 , 02) }7rexp(-^fl) (^^2) 

= /^K^) -7rexp(-/3R)(-R)] 

+ 7P ® vre.p(-/3il) (^', M') - f sinh(^)i?'] 



+ log<j ^ exp 



-1 jj^^j.[R! {01,02), M' {01,02) 



— — sml 

7 



^exp(— 

pK)id0i) >. 



H^)R'{9i,02)}7r,,pi-0K)idO2) 

Thus with P probabihty at least 1 — rj, 

[7Vsinh(^) - /?] [p{R) - 7rexp(_;3R)(i?)] 

< 7 [p{r) - 7rexp(-/3R) tt) - 0<:(7rexp(-/3i?) , tt) - log(77) + C(/3, 7) 

where C(/3, 7) = logj^ exp -7^|^^ [i?'(0i, ^2), M'(ei, ^2)] 

-f Sinh(^)i?'(01,02)}7rexp(-/3i?)(d02) Vrexp(-;3i?)(d0l)}. (1.28) 

Remarking that 

^[p,T^exp{-/3R)] = P[p{R) - T^eKp(-f3R)iR)] + OC{p,Tr) - 3<:(7rexp(-/3iZ) , Tt) , 

we deduce from the previous inequahty 
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Theorem 1.42. For any real constants (5 and'j, with P probability at least 
I — r], for any posterior distribution p : Q — > M5|_(G), 

iVsmh(;^) [p{R) - vrexp(-/3fl)(-R)] < l[p{r) - Tr^^p(-i3R){r)] 

+ ^[p,-^cM-m] ~ + C'(/3,7). 

We can also go into a slightly different direction, starting back again from 
equation H1.28() on page 1671 and remarking that for any real constant A, 

- 7rexp(-/3i?)(r)] + 0C{p,7r) - 3C(vrexp(-/3R), tt) 

< Xp{r) +%{p,Tr) +log{7r[exp(-Ar)]} = 3C[/3, 7roxp(-Ar)] • 

This leads to 

Theorem 1.43. For any real constants (5 and 7, with P probability at least 
1 — rj, for any real constant A, 

[iVsinh(^) - /?] [p{R) - ^exp(-/3fi)(i?)] 
< (7 - A) [p{r) - TTey,p(-i3R) {r)] + OC [p, 

^exp(— Ar) -log(??) + C(/3,7), 

where the definition of C{(3,"f) is given by equation (|1.28|1 on page \6'/\ 

We can now use this inequality in the case when p = TTexp{-Xr) &iid com- 
bine it with inequality (|1.27|) on page 1221 to obtain 

Theorem 1.44 For any real constants (3 and 7, with P probability at least 
1 — rj, for any real constant A, 

[^sinh(^) - 7] [TTe^p(-Xr){r) - vrexp{-/3i?)(?')] < C(/3,7) - log(r/). 
We deduce from this theorem 

Proposition 1.45 For any real positive constants Pi, P2 oind 7, with P 
probability at least 1 — rj, for any real constants Ai and A2, such that A2 < 
^2;^sinh(;^)"^ and Ai > (3i^smh{^y^, 

7rexp(-Air)(?') " ■^eM-^2r)if) < 7rexp(-/3ifl) l?^) " TTexpC-ftR) 

C(/?i,7)+log(2/77) C(/32,7)+log(2/^) 
^sinh(^)-7 7-^smh(^) ' 
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Moreover, 7rgxp(-/3i_R) and T^ex.p{-P2R) being prior distributions, with P prob- 
ability at least 1 — ?7, 

7 Kxp(-/3i - 7roxp(-/32fi)('^)] 

< 77rcxp(-/3ii?) ® 7rexp(-/32R) [^-j.{R', M')] - log(r?). 

Hence 

Proposition 1.46 For any positive real constants j3i, P2 o-n-d 7, with P 
probability at least 1 — rj, for any positive real constants Ai and A2 such that 
A2 < /32;^sinh(;^)"^ and Ai > f3i^ smh{^)~^ , 

7rcxp(-Air)(^) - '^cx.p{-\2r){r) 

< 7rcxp{-/3i/?) O 7rexp{_/32R) [^-^{R', M')] 

log(|) C(/?i,7)+log(|) C(/32,7)+log(j) 

+ + ^r^. _ . ,,,, + iVA2,;„X,/7' ' 



7 ^sinh(i)-7 7-i3r«i^^( 



In order to achieve the analysis of the bound -B(7rgxp(__Air)i /^i 7) given by 
Theorem II. IfflL there remains now to bound quantities of the general form 



log|vrexp(_Ar) exp{iVlog[cosh(;^)]7^exp(_Ar)("^')} } 



= sup N log [cosh(^)] p ® 7rexp(_A) (m') - X [p, Vrexp{-Ar)] • 

Let us consider the prior distribution p, G M.\{@ x 0) on couples of 
parameters defined by its density 



dp 



d{'K ® vr) 



(^1,^2) = exp{-/3i2(^i) - I3R{02) + a<^-^ [^'(^1, ^2)] }, 



where the normalizing constant C is such that p{Q x 0) = 1. Since for 
fixed values of the parameters 6 and 9' € 0, m'{9,9'), like r(9), is a sum of 
independent Bernoulli random variables, we can easily adapt the proof of 
Theorem II. 41 on page 1111 to establish that with P probability at least 1 — r/, 
for any posterior distribution p and any real constant A, 

ap (g) Vrexp(-Ar)("T'') < ap (g) 7rexp{-Ar) ['^-^{M')] 

+ X{p (g) 7rexp(-Ar) , /^) " log(r/) 
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= 3C [p, 7rexp(-/3i?)] + 3C [vrcxp{_Ar) > 7rexp(-/3i?)] 

+ log|vrcxp(-/3ij) <8) 7rexp(-/3R) exp (a^>_^ oM') |-log(7?). 

Thus for any real constant (3 and any positive real constants a and 7, with 
P probability at least 1 — t?, for any real constant A, 



logjvr, 



exp{— Ar) 



expjiVlog [cosh(;^)]7r, 



cxp(— Ar) 



< 



(m')}]} 

sup ( f log[cosh(;2,)]i3C[p^vrexp(_/3i?)] +3<:[7r, 



cxp(-Ar)i ^exp(-/3ij) 



+ log{7rexp(-/3ij) ® 7rexp(-/3R) [exp(a$_^ oM')] } 

- log(?7)} - Vrexp(-Ar)] 



:i.29) 



To conclude, we need some suitable upper bound for the entropy 
3<C[/3, 7rexp(-/3_R)] • This question can be handled in the following way: using 
Theorem 11.421 on page I68| we see that for any positive real constants 7 and 
/3, with P probability at least 1 — for any posterior distribution p, 



3<^[Pi 7rcxp(_/3i?)] 



< 



iVsinh(i] 



(3[p{R) - vrexp(-/3K)(-R)] +3C(p,7r) - 3<:(7rexp(-/3fl), vr) 
7[/9(r) -7rexp(_/3fl)(r)] 



+ 3<;(p, vr) - 3C(7rexp(-/3R) , vr) 

^^h^expC 27 J 

/3 



+ 



iVsinh(;^) 



-^\%[p, 7rexp(-/3i?)] + C(/3, 7) - log(r/) I . 



In other words, 



Theorem 1.47. For any positive real constants (3 and 7 such that (3 < 
A^sinh(-^), with P probability at least 1 — rj, for any posterior distribution 



X [p, vr, 



exp(-/3R)_ 



< 



3C[/>,7rexp[-/3^smh(^)-lr]] C(/?, 7) - logC??) 



+ 



Nsmh{l] 



A^sinh(^) 



- 1 
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__ _ Aflogrcosh(^)l 
Choosing in equation ()1.29() on page [701 « = — ^ and 

1 - JVsmh(^) 

N . ^'y^ iVlogrCOSh(^)l 

/3 = A— sinh(-^), so that a = — ^ , we obtain with P proba- 

^ 7 

bihty at least I — rj, 



log{vrcxp{-Ar) exp{iVlog[cosh(;^)]7r<,xp(_;,^)(m')}] } 
<f [C(/3,7)+log(|)] 



+ 1 



log|vrcxp(-/3R) (8) vrexp{-/3ij) [exp(a^>_^ oM')] | 



+ log(|) 



This proves 

Proposition 1.48. For any positive real constants X < j, with P probabil- 
ity at least 1 — r], 



log|7rexp(-Ar) exp{ iV log [cOsh(;^)] 7rexp(-Ar-) ("i') } } 



9 \ 

<^^[C(f sinh(^),7)+log(|)] 



+ 1 



logi vr 



®2 

cxp[-^smh(^)i?] 



exp 



iVlog[cosh(^)] 



log[cosh{4^)] OM' 



+ (l-^jlog(|). 

We are now ready to analyse the bound i?(vrcxp(-Air)) 7) of Theorem 
ll.39l on pagelUHl 

Theorem 1.49. For any positive real constants Ai, A2, Pi, P2, P oind 7, 
such that 



Ai < 7, 
A2 < 7, 



/3i<^sinh(^), 

N\2 

7 
NX2 



/52>^sinh(^), 



/?<^tanh(^), 



N' 
1 
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with P probability 1 — rj, the bound -B(vrcxp{-Air)7 /9) 7) of Theorem \l.'jy[ on 
page\EM satisfies 



^(7I"exp{-Air),/5,7) 



log(-) 



+ 



C(/3i,7)+log 



+ 



C7(/32,7)+log(^ 



C(^smh(^),7)+log(^] 



+ (1 - t) 



2Ai 

7 

■exp[-^smh(^)i?] 



exp 



jVlog[cosh(^)] ^ 



oM' 



f) \oga)-\og[vm>{{i]y 



C(^smh(^),7)+log(|) 



2A2 



-^2 I T 

+ (1 - f ) logjTT^^ ..2 

V 7 / [ exp[ -2 smh{-i 

'iVlog[cosh(^)] 



)R] 



exp 



1 A2 

7 



<I> , r w 7 M oM' 

log[cosh(-j^)l 

— 



+ 1 



A2 

7 



log(I)-log[z.({/?})K{7})^ 



where the function C{(3,"f) is defined by equation (|1.28j) on page\E 



1.5.4- Adaptation to parametric and margin assumptions. To help under- 
stand the previous theorem, it may be useful to give linear upper-bounds 
to the factors appearing in the right-hand side of the previous inequality. 
Introducing 6 such that R{0) = infe R (assuming that such a parameter 
exists) and remembering that 



"i) < a ^ sinh(a)p + 2o ^ sinh(|)^m, 
^'-a(p) < a"^[exp(a) - l]p, 
^a{p, ^) ^ ffl^^ sinh(a)p — 2a^^ sinh( 



a\2 



m, 



a G R+, 
a G 1R+, 
a G ]R,+ , 
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M' {01,92) <M\eue) + M'{92,e), 
M'{ei,e) < xR'{ei,e) + ip{x), 



X e M.+,9i G e, 



(the last inequality being rather a consequence of the definition of ip than a 
property of M'), we easily see that 



< f sinh(;^) [tt^^p(-p^r){R) - vrexp(-/32R)(i?)] 
-I- 

7 

■ sinhfi") \tt / a_ D 

7 



+ ^ sinh(2^)27rcxp{-/3iij) ® 7rexp(-/32R)(M') 



< f sinh(^) [7rexp(-/3iR)(-R) - 7rexp(-/32i?)(i2)] 



+ 



2xN 



7 



■ sinh(2^)2|7rexp(-/3ii?) 9)] + 7T^^p(-p^R) [R'{-, 9)] } 



+ — sinh(2^) (^(x). 



C(/?,7) < log|7rexp(-/3ij)|exp 
< log<^ 7rexp(-^i?){exp 



< log|7rexp(-/3ii){exp 



2iVsinh(^)\,xp(-/3iJ)(M')]}} 
2Arsinh(2^)V(,^)]}| 

+ 2N sinh( Vexp(-/3iJ) ^)] 

^ ^ 2xNsmh{^fR'{.,9)\]^ 

+ 2xiVsinh(2^)Vexp(-/3fl) 9)] + ANsmh{^f<f{x) 

+ 2x7Vsinh(2^)Ve,p(_;3R) [R'{; 9)] + 4Arsinh(2^)V(x) 

< 4a;Arsinh(2^)Vexp[_(/3_2:r7Vsinh(2^)2)i?] [^'(•>^)] 

+ 4iVsinh(2^)VW. 



logjTT, 



exp(-/3i?) 



exp ATa^-QoM' 



')]} 



< 21og{7rexp(-/3i?) [exp(iV[exp(a) - 1]M'(-,^))] } 
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< 23;A^[exp(a) - l] 7rexp[-(/3-xAf[exp{Q)-i])R] [R'{-,^)] 

+ 2xA^[exp(a) - l]f{x). 

Let us push further the investigation under the parametric assumption 
that for some positive real constant d 

^lim^/3^exp(-/3K) [R'{; 0)] = d, (1.30) 

This assumption will for instance hold true with = § when : — > (0, 1) 
is a smooth function defined on a compact subset Q of JR,"" that reaches its 
minimum value on a finite number of non degenerate (i.e. with a positive 
definite Hessian) interior points of G, and tt is absolutely continuous with 
respect to the Lebesgue measure on Q and has a smooth density. 

In case of assumption (|1.3()j) . if we restrict to sufficiently large values of 
the constants /3, Pi, (32, Ai, A2 and 7 (the smaller of which being as a rule f3 
as we will see), we can use the fact that for some (small) positive constant 
6, and some (large) positive constant A, 

-(l-5)<vre,p(_,«)[i?'(-,^)] <-(l + <5), a>A. (1.31) 
Under this assumption, 

7rexp(-/3i/?) 'S) Vrexp{-/32/?) [^'_^(i?', M')] 

<f sinh(^)[A(l + 5)_^(l_5)] 

+ 2^ sinh(2^)2(l + 5) [A + ^] + M sinh{^)Mx). 
C(/?,7)<^(l + ^)log( ,_,,^l,(^). ) 

+ 2xiVsinh(2^)2 + 4iVsinh(2^)V(x). 

< 2xiV[exp(a) - l] ^ _ J^l^^) - D + '""f^"^^"^ " 



Thus with P probability at least 1 — rj 



5(vrexp(-Air), A7) < -(7 - Ai)f sinh(^)A(i _ 5) 



+ (7-Ai){fsinh(^)(i±f^ 



+ ^ sinh(^)2(l + 5) + ^] + f sinh(^) V(x) + 
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4xiVsinh(^) %^_J^+S; . + 4iVsinh(^)V(x) + log(J) 



+ 



+ 



V 'IN I •_ 

4xiVsinh(^) %^_ J^+f);^^). + 4iVsinh(^)V(x) + log(I) • 
- ^sinh(^) 



+ 



.... 

^|4xiVsinh(2^)2,,,,,^ ^ 

7 I ^sinh(^)-2xArsinh(^)2 



+ 4iVsinh(2^)V(x) + log(J) 



+ (l-^)(2.(l + .)| 



a;7 



Ai sinh(-^) 

^^Pl i-Ai 

7 



-1 



+ 2iV exp 



/log[cosh(^\ ■ 



1_ Ai 

7 



^^tanh(^)-l\ 7 



7 

+ (l - ^) log(J) - logH{/?}M{7}K 



^|4xiVsinh(2l^)2^^^- 



7 

+ 4iV sin] 



2d(l + S) I 



a;7 



exp 



A2 sinh(-^) 

+ 2iV exp 



ih(2^)V(a;)+log(p} 



1 



/log[cosh(^\ _ ■ 

(1 - log(: 



I)-log[i/(/3)i.(7)e] 



Now let us choose for simplicity (^2 = 2A2 = 4/?, /3i = Ai/2 = 7/4, and let 
us introduce the notations 

C, = -sinh(^), 

N 7 
C2 = — tanh(- ^ 
7 
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iV2 . ^2 

<^3 = :^[exp(^)-lJ 

2iV^(l - f ) p . 
and O4 = 5 — ^ — expl 



r 



to obtain 



Ci7 

-B(7rexp(-Air),/3,7) < g^'^^"'^)^ 

+ ^{ ^ + x^d + ^) [f + ^] + ^^(x)} + i log(I) 



+ 



2Ci - 1 

1 

+ 



2-Ci 



2x7(1 + d)d 



+ 



+ C7iXx) + log(I) 



A'' — X7 



+ 



N V2C3 7 'AT' 



. I 



+ 



(l-^)|2d(l + ^) 



X7 



AT [ 7C4 



7 



.X7 



7^ 



A^(l-f 



+ 1 



2^ 

7 



log(J)-log[z/(/3)i/(7)e 



This simplifies to 



Ci 7 

^(7rexp(-Air-),/3,7) < g-(l - ^)d-^ 



+ 2Ci(l + 5)d + log(^) 



^ + (4Ci-2)(2-Ci) + 4C„ _ 2 



1 + 

7 
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+ 



(1 + 6)dxj ( 



N 



I '-'1 2Ci-l \2Ci N 



2(l-f 



\2C3 N 



+ 



4Ci/3 
7(4C2-2) 



+ 



(1 + 6)dx-f^ \ c, j_ 2 



N/3 



16 ^ 2-Ci \ Ci N/3 J 



+ 1 



2/3\ 1 



7 / 2C2-1 



4Ci 

C4 



2/3 '\ _ ofx 



2 (' 

+ ^^(^){^ + 4fe+2^ + ^3 + 



4/3 



7(4C2-2) + 4C2-2 



C4 



This shows that there exist universal positive real constants Ai, A2, -Bi, 
B2, B3, and B4 such that as soon as < < ^2, 



^(vTexpC-Air), /?, 7) < -Bl{l " <5)d^ + ^2(1 + 6)d 



Thus vrgxp(_Air)(^) ^ '''"exp(-/3R) (^) ^ i^i^e R + ^'''''g'^'*'^ as soon as moreover 



< 



5i 



R (1+^) I g4^y(x)-B3 log[i^(/3)t^(7)^>?] 



Choosing some real ratio a > 1, we can now make the above result uniform 
for any 

/3,7 G A« = {a^fe e 1N,0 < A; < (1.32) 
by substituting i^(/3) and 1^(7) with ^^"^^'^^ and — log(?7) with — log(?7) + 

[ log(a) J- 

Taking moreover for simplicity rj = e, let us summarize the type of result 
we got by 

Theorem 1.50. There exist positive real universal constants A, Bi, B2, 
S3 and B4 such that for any positive real constants a > 1, d and 5, for 
any prior distribution vr G M^|_(0), with P probability at least 1 — e, for any 
/3, 7 G Aq, (where is defined by equation l\l.'62\i above) such that 

13' 

sup — [7rexp(-/3'R) {R) - inf R] - 

/3'GI[l,/3'>/3 " 



< 5 



May 28, 2006 



Olivier Catoni 



78 



1 Inductive PAC-Bayesian learning 



and such that also for some positive real parameter x 

7max|x, 1} A3 B Bi 
< ana — < 



N T 7 „ (1+5) , B4^V'(^)-2B3log(e)+4B3log 

ooTT-r; H 



log(iV) 



N ^ 

'2(1-5) (l-5)d 

the hound -B(7rgxp(_2r)) /3) 7) given by Theorem M.SfA on paae \6,'^ in the case 
where we have chosen v to he the uniform prohahility measure on A^, satisfies 
B{'^exp{-^r)^ P^l) ^ 0, proving that /3(7rgxp(_2.r)) > (3 and therefore that 

^exp(-7§)(^) < 7rcxp(-/3R)(^) < inf-R+ ^i^^. 

What is important in this result is that we do not only bound T^expi-^r) 
but also i3(7rj;xp(_2r), /?, 7), and that we do it uniformly on a grid of values 
of P and 7, showing that we can indeed set the constants /? and 7 adaptively 
using the empirical bound -B(7rgxp(_2r), /3, 7)- 

Let us see what we get under the margin assumption (|1.2H) (see page 117]) . 
When K = 1, ip{c~'^) < 0, leading to 

Corollary 1.51. Assuming that the margin assumvtion M .2l\ (on page \4T\ ) 
is satisfied for k = 1, that : G ^ (0, 1) is independent of N (which is the 
case for instance when P = and is such that 

^,lmi^/3'Kxp(_/3'fi)(i?) -inf ^] = d, 

there are universal positive real constants and Bq and A^i € IN such that 
for any N > Ni, with P prohahility at least 1 — e 

2 



B d 

7re,p(_^.)(i?) <infi?+-|p 



1 + — log ' 



d °\ e 



where 7 G argmax^gA2 ^a-xj/^ S A2; i?(7rexp(-'yr), /3, 7) < O} (where A2 is 
defined hy equation ()1.32|) on page \7T\ ). 



1 



When K > 1, ip{x) < (1 - k-^){ Kcx) '""^ , and we can choose 7 and x such 



2 

that Zr^p{x) ~ (i to prove 



Corollary 1.52. Assuming that the margin assumption (jl.21j) is satisfied 
for some exponent k > 1, that i? : B — > (0, 1) is independent of N (which is 
for instance the case when P = p®^ and is such that 

^,l™^/?'Kxp(-/3'/?)(^) -igf ^] = d, 
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there are universal positive constants Bj and Bg and A'^i G IN such that for 
any N > Ni, with P probability at least 1 — e, 



7rexp{-9r)(i?) < inf + B^c 



1 

' 2k-1 



B. f\og{N) 



2k 

2k-1 



where 7 G argmax^gA2 G A2; -B(7rgxp(-7§)> /3, 7) < O} (A2 being de- 

fined by equation (|1.32|) on page \7'/\ ). 

We find the same rate of convergence as in Corollarv ll.80l on page l481 but this 
time, we were able to provide an empirical posterior distribution 7rcxp{_7r) 
which achieves this rate adaptively in all the parameters (meaning in partic- 
ular that we do not need to know d, c or k). Moreover, as already mentioned, 
the power of N in this rate of convergence is known to be unimprovable in 
the worst case (see j2Hl EH ES] i and more specifically in |3j — downloadable 
from its author's web page, — Theorem 3.3 on page 132). 

1.5.5. Estimating the divergence of a posterior with respect to a Gibbs prior. 
Another interesting question is to estimate 7rexp(-/3_R)] using relative 
deviation inequalities. We follow here an idea to be found first in Audibert 
[3 page 93]. Indeed, combining equation 1)1. 24() with equation 1)1.25(1 on page 
1611 we see that for any positive real parameters (3 and A, with P probability 
at least 1 — e, for any posterior distribution p : Q. ^ Mj',_(B), 



3<:[p,7r, 



exp(-/3R)_ 



< 



A^Al 2 



log 



1 + A 



1 - A 



[p(r) -7rexp(-/3ij)(r)] 



N 



— log(l - \^)p Vrexp(-/3iJ)(m') 



+ [p, vrexp(-/3i?)] - log(e) I + OC{p, n) - X [vr, 



exp(-/3_R) 



< ^[P.^exp[-Alog(l±A).]] + J^^[P^^oM~m] - ]^ 



+ log 



TT 



Thus, putting 7 = ^ log(Y^), we obtain 



cxp[ 



■log(f±i)r] 



,{ 



exp 



2A 



logfl - A- 



log(e) 
'■)p{m')] } 



Theorem 1.53. For any positive real constants [3 and 7 such that (3 < 
A^tanh(-^), with P probability at least 1 — e, for any posterior distribution 
p:n^M\{Q), 
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— tanh ( — 
N \N 



exp[-&tanh(^)-ir]J 



+ l°g|^cxphfctanh(^)-ir] 



exp{/3tanh(^)-i log[cosh(^)]p(m')}] } l 



This theorem provides another way of measuring overfitting, since it gives 
an upper bound for X[^eM-^i!^rihi^)-^r]^'^<^M-PR)]- ^^^^ in 

combination with Theorem II .ll)! on page as an alternative to Theorem 
11.181 on page 1301 It will also be used in the next section. 

An alternative parametrization of the same result providing a simpler 
right-hand side is also useful: 

Corollary 1.54 For any positive real constants (3 and 7 such that P < j, 
with P probability at least 1 — e, for any posterior distribution p : 17 — > 

M^(e), 



^[/''^exp[-7V^tanh(i)i?]] 



< 1 



/9 



+ log|7rexp(-/3r) exp{ iV^ log [cosh(^)] p(m') } } 



1.5.6. Comparing two posterior distributions. Estimating the effective tem- 
perature of an estimator provides an efficient way to tune parameters in a 
model with a parametric behaviour. On the other hand, it will not be fitted 
to choose between different models, especially in the case when they are 
nested (because as we already saw in the case when is a union of nested 
models, the prior distribution vrgxp(_/3_R) is not providing an efficient local- 
ization of the parameter in this case, in the sens that T^exY>{-i3R){^) is not 
going down to infe R at the desired rate when j3 goes to -|-oo, requiring to 
resort to partial localization). 

Once some estimator (in the form of a posterior distribution) has been 
chosen in each submodel, these estimators can be compared between them- 
selves with the help of the relative bounds that we will establish in this 
section. 

From equation (|1.23|) (slightly modified by replacing vr tt with -k"^ ® vr^), 
we obtain easily 
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Theorem 1.55. For any positive real constant X, for any prior distribu- 
tions 7r^,7r^ € M^(0), with P probability at least 1 — e, for any posterior 
distributions pi and p2 : ^ — > M^(0), 

- iVlogjl - tanh(A) [^^(i?) - Pi{R)] } < A[p2(r) - pi(r)] 
+ iV log [cosh (g)p2{m') 

+ X{PUTT^) +0C{p2,TT^) -log{e). 

There enters into the game the entropy bound of the previous section, 
providing a locahzed version of Theorem 11.551 We will use the notation 



i(g) = tanh(a) ^ [l — exp(— ag)] < 



tanh(a) 



a, g G m. 



Theorem 1.56. For any sequence of prior distributions ('/r*)ig]N € M5|_(B)''^, 
any probability distribution p. on M, any atomic probability distribution u 
on with P probability at least 1 — e, for any posterior distributions 

Pi,P2:n^Ml{e), 

P2{R) - pi{R) < B{pi,p2), where 



B{pi,P2)= inf ^±{[p2{r) - pi{r)] 



+ f log[cosh(A)]pi0p2K) 



+ 



A(l-^ 

71 



X[pu 



' ■^exp(-/3ir)J 



+ log{<xp(-ft.) [exp{Af log[cosh(^)];,i(m')}] } 



+ 



A(l-^ 

72 



X[p2,7r- 



exp(— /32r)J 



+ logi TT-' ^ ^ 



exp{/32f log[cosh(2i)]^2(m')}]} 



ir-i)"' + (I-i)"' + i 



log[3-^u{(3i)u{P2Hji)u{j2HX)p{i^p{j)( 
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The sequence of prior distributions (7r*)j£]]\f should be understood to be typ- 
ically supported by subsets of corresponding to parametric submodels, 
that is submodels for which it is reasonable to expect that 

hin f3 [<^p(_^fl) (R) - ess inf R] 

exists and is positive and finite. As there is no reason why the bound 
B{pi,p2) provided by the previous theorem should be subadditive (in the 
sense that B{pi,p3) < B{pi,p2) + B{p2, ps)), it is adequate, at least from a 
theoretical point of view, to consider some workable subset T C !M5^(0) of 
posterior distributions (for instance the distributions of the form '?r*^p(_^r) ' 
i € M, /3 G ]R,+ , it is understood that 7 is allowed to be a random subset of 
lv[\_{@), as in this suggested example), and to define the subadditive chained 
bound 

{n-l 
Y,B{pk,Pk+i); n G m\{pk)l=o e J'^+S 
fc=0 



Proposition 1.57. With P probability at least 1 — e, for any posterior 
distributions pi,p2 G "P, P2{R) — Pi{R) < B{pi,p2). Moreover for any 
posterior distribution pi £ T, any posterior distribution p2 7 such that 
B{piiP2) = infp3gj>i?(pi, ps) is unimprovable with the help of B in T in the 
sense that inip^^y B{p2, ps) > 0. 

Proof. The first assertion is a direct consequence of the previous theorem, 
therefore only the second assertion requires a proof: for any p^ € J", we 
deduce from the optimality of p2 and the subadditivity of B that B{pi, p2) < 
B{pi,Pz)<B{pi,p2) + B{p2,pz). □ 

This proposition provides a way to improve a posterior distribution pi € IP 
by choosing p2 G argminpgj>S(/9i,p) whenever B{pi, P2) < 0. This improve- 
ment process is proved according to Proposition 11.571 to be a one step pro- 
cess: the obtained improved posterior p2 cannot be improved again using 
the same technique. 

Let us give some example of possible starting distribution pi for this im- 
provement scheme: pi may be chosen as the best posterior Gibbs distribution 
according to Proposition ^301 on page 1661 More precisely, we may build from 
the prior distributions vr*, i G M, a global prior vr = X^ieiN P(^)'''"*- 
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then define the estimator of the inverse effective temperature as in Propo- 
sition ^301 and choose pi G argminpgy /3(/9), where y is as suggested above 
the set of posterior distributions 



J'={<xp(-/3.);i€]N,/3G]R+}. 



(This starting point pi should already be pretty good, at least in an asymp- 
totic perspective, the only gain in the rate of convergence to be expected 
bearing on spurious log(A^) factors). 

For more elaborate uses of relative bounds, we refer to the third section 
of the second chapter of Audibert [31 , where an algorithm is proposed and 
analyzed, which allows to use relative bounds between two posterior distri- 
butions as a stand alone estimation tool. 

1.5.7. Two step localization of relative hounds. Let us consider again in 
this section the case when we want to choose adaptively between a family of 
parametric models. Let us thus assume that the parameter set is a disjoint 
union of measurable submodels, so that we can write Q = Um6M0m, where 
M is some measurable index set. Let us choose some prior probability dis- 
tribution on the index set p, G M^(M), and some regular conditional prior 
distribution on (M, 9), tt : M ^ 'M\{Q), such that 7r(m, 9^) = 1, m G M. 
Let us then study some arbitrary posterior distributions : 17 — > M5|_(M) 
and p : n X M -.^ M^(9), such that p(a;,m,9m) = 1, a; G 17, m G M. We 
would like to compare i^p{R) with some doubly localized prior distribution 
/^exp[-^7r,,p(_^fl)(i?)] [^exp{-/3i?)] (R) (where C2 is a positive parameter to be 
set as needed later on). We will define to ease notations two prior distribu- 
tions (one being more precisely a conditional distribution) depending on the 
positive real parameters (3 and (2, putting 

vf = vrexp(-/3i?) and p = M^^p[„_^-(^)] . (L33) 

Similarly to Theorem 11.261 on page 1441 we can write for any positive real 
constants f3 and 7 



P<^ (JLW) (71 7f) 



exp 



-iVlog[l -tanh(^)i?'] 



— jr' — N log [cosh(-^)] m' 



< 1, 
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and deduce, using Lemma lOl on page ITT] 
pje 

- ^{vp - 7l7f)(r) - iVlog[cosh(^)] {vp) ® {jiTf){m') 



sup sup — A^log[l — tanh(-^)(z//9 — /i7r)(i?)] 



< 1. (1.34) 



This will be our starting point in comparing i'p{R) with 'JlTf{R). However, 
obtaining an empirical bound will require some supplementary efforts. For 
each m € M, we can write in the same way 









exp 



-N log [1 - tanh(^)i?'] - 7r' - iV log [cosh(;^)] m' 



< 1. 



Intagrating this inequality with respect to p and using Fubini's lemma for 
positive functions, we get 



P<^ 7l(7f(8)7f) 



exp 



-N log [1 - tanh(;^)ii'] - 7r' - iV log [cosh(;^)] m 



< 1. 



Let us make clear that 'pijf ®Tf) is a probability measure on M x x 
0, whereas ijlTf) ^ (a* 7f) considered previously is a probability measure on 
(M X 9) X (M X 9). We get as previously 



exp 



sup sup j — A^log[l — tanh(-^)z^(p — 7r)(i?)] 



- 7i/(p - 7r)(r) - iV log [cosh ® 7r)(m') 
-%{u,p)-u[%{p,t:)\] 

Let us eventually recall that 

%{y,p) = li^.i^ - -pMR) + n'', /^) - p) 
Xip, 7f) = Pip - W) (R) + X{p, tt) - X{7f, tt) . 

From equations ()1.34l) . H1.35() and H1.37() we deduce 



< 1. (L35) 



(L36) 
(L37) 



Proposition 1.58. For any positive real constants (5, 7 and (2, with P 
probability at least 1 — e, for any posterior distribution i' : Q ^ Jy{^{M) and 
any conditional posterior distribution p : ^ x M Mi_(9), 
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-Nlog[l-tanh{^){up-jL7r){R)] -/3u{p-W){R) 

< j{up - 7Ivf)(r) + log [cosh (;^)] (up) (g) {jiW){m') 

+ X{u,-p) + u[X{p,TT)] -z.[3C(7r,7r)] +log(|). 

and 

- N log [1 - tanh(;^)i/(p - 7r){R)] 

< 7z/(/9 — 7f)(r) + A^log[cosli(;^)]i/(p (g) Tf){m') 

+ %{v,-p) + u[%{p,W)\ +log(f), 

where the prior distribution 'p, If is defined by equation (|1.33|1 on vaae\8il\ and 
depends on (3 and C,2- 

Let us put for short 

T = tanh(^) and C = iV log [cosh(;^)] . 

We will use some entropy compensation strategy for which we need a 
couple of entropy bounds. Let us assume that j3 < NT. We have according 
to Proposition 11.581 with P probability at least 1 — e, 

1^ [X{p, W)]=pi^{p- 7f ) (R) + ly [X{p, tt) - X{W, vr)] 

/3 



< 



NT 



7i/(p — vr)(r) + Ci'{p (g) 7r)(m') 

+ X{u,Jl)+u[X{p,7T)] +log(f) 

+ u[X{p,Tr)-X{lT, tt)]. 



Similarly 

X{u,-p) 



I + C2 



< 



{V - 7l)7f(i?) + X{U, p) - XijL, p) 



/3 



(1 + C2)NT 



7(1^ — p)iT{r) + C{vt:) (g) (/i 7r)(m') 



+ 3<:(z.,7l)+log(^ 



+ X{u,p) - X(jl,p). 



Thus, for any positive real constants /?, 7 and Q, i = 1, . . . ,5, with P prob- 
ability at least 1 — e, for any posterior distributions i', f 3 : J7 — > M5^(0), any 
posterior conditional distributions p, pi, P2, Pi, P5 : x M ^ JV[3,_(0), 
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-N\og[l-T{up-TiTf){R)\ -l3u{p-Tf){R) 

< jivp — 'p.Tf){r) + C{i'p) ® {jIW){m') 

+ X{i^, Jl) + u [%{p, tt) - tt)] + log(f ), 

NT 

Ci-^M[:K(pi,7f)] < Ci77^(Pi -7f)(r) + CiC7Z(pi ® 7f)(m') 

NT 

+ Ci7l[0C(pi,7r)] + Ci log(f ) + Ci-^7l[3C(pi,7r) - X(7f,7r)], 

NT 

C,2^v[%{p2,Tf)\ < C,21V{.P2 - 7f)(r) + C,2Cv{p2 ® 7f)(m') 
+ C23<:(I^,7X) + C2l'[3C(/92,7f)] + C2 log(f ) 

NT 

+ C2-^z^[3C(p2,7r) -3C(7f,7r)], 

C3(l + C2)— 3C(z/3, 71) < C37(i^3 - 7l)7f(r) 

+ CsCil'^svf) ® (z^3Pi) + {y^P\) ® (T^vf)] (m') + C33<:(i/3,7^) + C3 log(l) 

+ C3(l + C2) [^(^3, /x) - 3C(7I, /x)] , 

C4 — J^3 [3C(P4, 7f)] < C471'3(P4 - Tf)ir) 

+ C4Ci^3(p4 <8) 7f)(m') + C43^(Z^3,7^) + C4l^3 [3C(P4, Vf)] + C4 log(f ) 

NT 

+ C4-^i^3[3C(p4,7r)-X(7f,7r)], 

NT 

C5^m[3<:(p5, 7f)] < CblT^iPb - 7f)(r) + C5ClJip5 «) 7f)(m') 

+ C5M[3C(/95, 7f)] + C5 log(f ) + C5^P[^{P5, Tt) - ^(Tf, Tt)] . 

Adding these six inequalities and assuming that ^4 < C3[(l + (2)^ — l] , 
we find 

- TV log [1 - T{iyp - 7Z7r)(i2)] - /3(i/p - 7Z7f)(i2) 

< -Ariog[i -r(i/p-7i7f)(i?)] - p{up-jiw){R) 

+ Ci(^ - i)M[3<:(pi,7f)] +C2(^ - iHoc{P2,w)] 

+ [C3(l + C2)^-C3-C4]3C(i.3,7x) 
+ C4(^ - l)i^3[3<:(p4,7f)] +C,{^ - l)jl[X{p,,W)] 

< j{up - JlW){r) + C,ilV-{pi - 7r)(r) + C,2iv{p2 - W){r) 
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+ C37(i^3 - /^)7r(r) + C47'^3(P4 - vr)(r) + C57/"(P5 - 7r)(r) 

+ C4i^3(P4 Vf) + C5/W(P5 ® 7f)] (m) 

+ (1 + C2) - 3C(7I, m)] + [%{p, vr) - 3C(7r, vr)] 

+ Ci^p[aC(pi, vr) - 3C(7r, vr)] + C2^u[X{p2, vr) - 3C(vr, vr)] 
+ C3(l + C2)^ [3C(i/3, /^) - n-p, P)] + U^yz [npA, vr) - X(7f, vr)] 
+ C5^7l[3C(p5, 7r) - 3C(vf, vr)] + (1 + Ci + C2 + Cs + C4 + Cs) log(f )• 

Let us now apply to vT (we shall later do the same with Ji) the following 
inequalities, holding for any random functions of the sample and the param- 
eters /i : n X ^ m and 5 : $7 x G ^ K,, 

Tf{g — h) — 3C(vf, vr) < sup p{g — h) — X{p, it) 

= log{vr[exp(5r-/i)]} 

= log{vr[exp(-/i)] } + log{vrexp(_;,) [cxp(5)] } 

= -7rcxp(-fe)(^) - 3<^(7rcxp(-fe), vr) + log{vrexp(-/i) [cxp(5)] }. 

When h and g are observable, and h is not too far from /3r ~ f3R, this gives 
a way to replace vf with some satisfactory empirical approximation. We will 
apply this method, choosing pi and ps such that /x vf is replaced either with 
Jipi, when it comes from the first two inequalities or with JIp^ otherwise, 
choosing p2 such that z^vf is replaced with i'p2 and p4 such that i^^W is replaced 
with f3P4. We will do so because it leads to a lot of helpful cancellations. 
For those to happen, we need to choose pi = T^expi-x^r)^ i = 1)2,4, where 
Ai, A2 and A4 are such that 

(l + Ci)7 = Ci^Ai, 

C27=(1 + C2^)A2, 

NT 

(C4 - (3)7 = C4^A4, 

C37 = Cs^As, 

and to assume that (^4 > (^3. We obtain that with P probability at least 1 — e, 
- iVlog[l - T{pip-i2W){R)] - p{up - TIW){R) 
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< 7(z/p - /i/9i)(r) + C37(«^3P4 - W5)W 



+ Ci^7I log 



Pi i exp 



+ (l + C2^)^|log|p2|exp 



c 

1 ,^ ATT 
I+C2-3- 



P4< exp 



/35< exp 



^IvTc; [C3Z^3/5i + C5P5] {m') 



+ (1 + C2) [3C(i., m) - 3C(m, //)] + u [%{p, vr) - X(p2, tt)] 
+ C3(l + C2)f^[3<:(^^3,M)-3C(M,M)] 



i=i ^ 



In order to obtain more cancellations while replacing Ji by some posterior 
distribution, we will choose the constants such that A5 = A4, which can be 
done by choosing 

r C3C4 
C4 - C3 

We can now replace Ji with /Ltexp-^ipi(r)-^4p4(r)) where 

7 



(l + C2)(l + ^C3)' 
(1 + C2)(1WC3)- 



Choosing moreover v^, = A*exp-fipi(r)-^4P4(r)) to induce some more cancella- 
tions, we get 

Theorem 1.59. For any positive real constants satisfying the above men- 
tioned constraints, with P probability at least 1 — e, for any posterior dis- 
tribution 1/ : J7 — ^ M.]^{M) and any conditional posterior distribution p : 



-Nlog[l -Tiiyp-JIW)iR)] -/3ii^p-TiW)iR)<B{iy,p,P), 
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dcf 



where B{u,p,[3) = j{up - fzpitir) 

+ (1+C2)(1 + ^C3) 







1- 


pijexp 



X p4 < exp 



/3(l+C2)(l+^C3) 



/3(1+C2)(1+^C3) 



+ (l + C2^)i^|log|p2|exp 



c 



<2P2(m') 



P4< exp 



<^ JVTC4 [C3Z^3Pi + Upi] (m') 




+ {l + C2)[0C{u,fi)-X{i^s,li)] 

+ V {%{p, Tt) - %{P2. Tt)] + (1 + 0) l0g(f ) . 

This theorem can be used to find the largest value of /? such that 

B{i',p,P) < 0, thus providing an estimator for I3{vp) defined as i^p{R) = 
'P'P{vp)T^ l3{vp){R)-: where we have mentioned explicitely the dependence of 
71 and vf in /3, the constant C2 staying fixed. The posterior distribution 
vp may then be chosen to maximize I3{up) within some manageable sub- 
set of posterior distributions 7, thus gaining the assurance that vp{R) < 
'P'P{up)^p(up)^-^^^ with the largest parameter I3(i'p) that this approach can 
provide. Maximizing f3{i'p) is supported by the fact that lim^^+00 'Pp^fsiR) = 
ess inf^TT R- Anyhow, there is no assurance (to our knowledge) that (3 ^ 
7I^7f^(i?) will be a decreasing function of P all the way, although this may 
be expected to be the case in many practical situations. 

We can make the bound more explicit in several ways. One point of view 
is to put forward the optimal values of p and u. We can thus remark that 

[7p(r) + X{p, tt) - X{p2, tt)] + (1 + C2)X{iy, /x) 

= U %[p, 7rexp(-7r)] + A2P2(r-) + / Tre^p{-ar){r)da + (1 + C2)0C{u, p) 

= -{3<:[p,^exp(-,.)]} + ^1 + ^^)^[^'^exp(-^-^/;^ 

•Y^P2(r)-3^/\exp(-..)(r)cia} 



-(l + C2)log<^/X 



exp< - 
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Thus 



B{u, p, /?) = (! + C2) [eiJ^3Pi(r) + ^lyspdr) 

+ log{/i[cxp(-^ipi(r) - ^4P4(r))] } 



-(l + C2)log<^/i 



exp 



A2 , , 1 
-p2{r) - 



7^3Pi(r-) + (l + C2)(l + ^C3) 



a 









Pi 1 exp 



1 /3(1+C2)(1+^C3) 



X /94< exp 



<5NT 

/3(1+C2)(1+=^C3) 



+ (l + C2^)i^|log|p2|exp 



c 

1+C2-3- 



C2P2(m') 



P4< exp 



+ I/{3C[p,7rexp(-7r)]} 



+ (1 + C2)3ChA^,,p(_^_^^. 



^ 1=1 ^ 



This formula is better understood when thinking about the following upper 
bound for the two first lines in the expression of B{u,p,P) : 



(1 + C2) 6z^3Pi(r) + ^4Z^3P4(r-) + log{//[exp(-^ipi(r) - ^4P4(r))] } 

A2 



(l + C2)log</U 



exp 



1 + 



-P2{r) 



< 1^3 A2P2(r) + / TTe^p{-ar){r)da - 7Pi(r) 
J \1 
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Another approach to understanding Theorem 11.591 is to put forward po 
7rexp(-Aor); for some positive real constant Aq < 7, noticing that 

iy[X{po,7r) - X{p2,7r)] = Xqv{p2 - po){r) - u[X{p2, po)]. 

Thus 

B{u,po,(3) < z/3[(7 - Xo){po - Pi){r) + Xo{p2 - Pi){r)] 



+ (1+C2)(1 + ^C3) 









Pi 1 exp 



X p4< exp 



CiNT 
/3(1+C2)(1+^C3) 



C5NT 
/3(1+C2){1+^C3) 



+ (l + C2^)i^|log|p2|exp 



c 



1+C2- 



<2P2{m' 



P4.S exp 




(1 + C2)X 



p 



exp I 



(7-^0)P0('-)+^0P2('') ^ 
I+C2 > 



-i.[ac(p2,Po)] + (i + ^Ci) fog(f) 
^ i=i ^ 



In the case when we want to select a single model fh{uj), and therefore to 
set V = Sfn, the previous inequality engages us to take 

fh G arg min (7 — XQ)pQ{m, r) + Xop2{m, r). 

vadM 

In parametric situations where vrexp(_Ar) — r*{m) + ^isi^^ get 

(7 - Ao)po(m,r) - Ao/02(m,r) ~ 7[r''(m) + (ie(m)(^ + ^^)], 
resulting in a linear penalization of the empirical dimension of the models. 

1.5.8. Analysis of the two step relative hound. We will not state a formal 
result, but will neverless give some hints about how to establish one. We 
should start from Theorem ll.25l which gives a deterministic variance term. 
Prom Theorem 11.251 after a change of prior distribution, we obtain for any 
positive constants ai and 02 1 any prior distributions Jli and /i2 £ M^(M), 
for any prior conditional distributions vfi and tt2 : M ^ M^(0), with P 
probability at least 1 — ry, for any posterior distributions uipi and V2P2, 
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+ %[{vipi) (g) {V2P2), (m TTl) ® (P2T^2)] 

+ log|(/Ii7fi) (2> (/X2 7f2) exp{-a2*f^2(-R',M') + Q;ii?'}j} -log(r/). 
Applying this to ai = 0, we get that 

(i/p- f3Pi)(r) < — (8) (i/3/9i),(/l7r) (8) (/xaTTi)] 

"2 1 

+ log|(/Ii7)® (/Is^i) exp{a2*_£2(i?',M')}]} -log(r?) 
In the same way, to bound quantities of the form 



Pi \ exp 



Ci{i'P + C,iPi){^') 



pi 



X p4 < exp 



C2 [Csi^sPi + C5P4] {m) 

Pi sup] Ci [(l/p) (2> (z^S/Os) + Cl^5(/9l ® Ps)] (jTl') - 3<:(/95,/9i) [ 
P2 SUpj C2 [C3('^3Pl) ® {l^bPo) 

+ Cbi^biPi® P6)]im') -0C{P6,P4)^ -X(iy5,i^3) >, 



where Ci, C2, pi and p2 are positive constants, and similar terms, we need 

to use inequalities of the type: for any prior distributions piTTi, i = 1,2, with 
P probability at least 1 — rj, for any posterior distributions ViPi, i = 1,2, 

cxsi.i'ipi) ® {v2P2){rn') < log|(|Lti7fi) {p,2Tr2)exp 0:3$^ (M')] | 

+ X[{uipi) (g) {V2P2), (w TTi) «) (/U2 7r2)] - log(?7)- 

We need also the variant: with P probability at least 1 — r/, for any posterior 
distribution z/i : O — > M5|_(M) and any conditional posterior distributions 
pi,/92:nxM^MV(e), 

a3i^i(pi ® P2){m') < log|/Ii(7ri (g) 772) exp Q;3$_a^(M') | 

+ 0<;(l^l,/Xl) + I^l{0<;[pi (8)p2,7fl (8)7f2]} -log(77). 
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We deduce that 



log< 1^3 



Pi <^ exp 



pi 



X p4 < exp 



C2 [C3i^3Pi + C5P4] (m') 



P2 







Pi sup 


-{ 


[ P5 


as [ 



^{P5,Pl) 



— i|log|(//7r) (g) (//5 7r5)exp Q;3$_^(M')j| 

+ X[{l^p) (8) (Z^SPS), (/^ TT {JI5 TT5)] + log(|) 

+ Ci log|/l5(7fi (gi 7f5) exp a3(I>_^(M') I 
+ X{u5,Ji5) + U5{X[pi (8>P5,7ri (8)7r5]} +log(|) 

^|log|(/l3 7fi) (g) (/I5 7f6)exp a3$_^(M')j| 

+ 3C[(z/3/9i) (g) (I'spe), (Ai3 TTi (8) (//5 TTe)] + log(|) 
log|/l5(7r4 (g) TTe) exp Q;3(^_2a(M') I 

+ X{U5, + f5{3C [p4 <8) P6, 7f4 (g) TTe] } + log(|) 



+ P2 sup 

P6 



- 3C(p6,P4) 



We are then left with the need to bound entropy terms hke 3C(f3/9i, /l37ri), 
where we have the choice of JI3 and tti, to obtain a useful bound. As could 
be expected, we decompose it into 

X{l'3Pl,Jl3TTi) = 3C(f3,/l3) + l'3[X{pi,TTi)]. 

Let us look after the second term first, choosing tti = 7Texp{-i3iR)- 
us [X{pi,ni)] = 1^3 [Pi{pi - ni){R) + 3C(pi, tt) - tt)] 

< ^ |a2i^3(pi - 7ri)W +^{1^3,1^3) + t'3[3<^(Pl,7fl)] 

+ log{Ai3(^f' ) [exp{-a2*^(i?', M') + aiE'}] } - log(77) 
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+ z/3[3^(pi,7r)-X(^i,7r)] 
+ log{/l3(^D [exp{ -02*21 (i?', M') + aii?'}] } - log(??) 



Thus, when the constraint Ai = is satisfied, 

Z^3[3C(pi,^fi)] < (l-^)"'^[x(i.3,/l3) 

+ log{/l3(^D [exp{-Q2*£i (i?', M') + aii?'}] } - log{r]) 

We can further speciahze the constants, choosing ai = A^sinh(^), so that 

-a2^22(R',M') + aiR' < 2iVsinhf —Ym'. 
N \2N y 

We can for instance choose 02 = 7, ai = A^sinh(-^), and Pi = Ai^ sinh(-^), 
leading to 

Proposition 1.60. With the notations of Theorem \l.5fA the constants be- 
ing set as explained above, putting tti = '?'"exp{-Ai^sinh(^)_R)' ^ proba- 
bility at least 1 — rj, 



z/3[ac(pi,ii)] < (1 



Ai\-iAi 



7 ^ 7 



3^('^3,/"3) 



+ 



log[ji3{nf ) [exp{2iVsinh(2^)2M'}] } - log(7?) 



More generally 



I/3[X(p,^l)] < (1 



Ai\-iAi 



7 ^ 7 



3C('^3,At3) 



+ log{/l3(^f ) [exp{2Arsinh(2^)2M'}] } - log(r?) 
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In a similar way, let us choose now /is = fJ-cxpl-aawiR)] ■ We can write 



/ as 



ai 



+ 



log|(/l37f) (At37r) exp{-a2^'^(i?', M') + aiR'] | - log(?7) 

+ 3C(i/,/i) -3C(/l3,/i). 



Let us choose 02 = 7, ai = A^sinh(-^), and let us add some other en- 
tropy inequalities to get rid of vf in a suitable way, the approach of entropy 
compensation being quite the same as the one used to obtain the empirical 
bound of Theorem 11.591 This results with P probability at least 1 — in 



< 



03 

ai 



7(1/ - /i3)7r(r) 



+ log{(/i3vf) «) (ji^tt) [exp{-7^'^ {R', M') + aiR']\ } + log(|] 



C6(] 



—)fi3[X{pQ,7r)] < Ce— 7^3(/06 - vr)(r) 

+ log{/l3(vf®2) [exp{-7^'^(i?', M') + aiR'}] } + log(^ 

+ Ce/^s [3<^(P6, tt) - aC(7f, tt)] , 

7/l3(/97 -7f)(r) 
exp{-7*^ (i?', M') + aiR'}] } + log(|) 



C7(l-^)/l3[3^(p7,vf)] <C7- 



+ log<^/i3(vr 



C8(l-^)43^(P8,vf)] <C8- 



(9(1 



7iv(/)8 - vr)(r) +D<:(z^,/i3) 
+ log{/l3(vf®2) [exp{-7^'^(i?', M') + aii?'}] } + log(|) 

+ C84X{ps,Tt)-X{7T,7t) 
J^)u[X{pg,7r)] <C9— Mp9-Tr){r)+X{u,Jl3) 
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+ log{/l3(7f®2) [exp{-7*^(i?', M') + aiR'}] } + log(|^ 

+ C9Z^[aC(p9,vr)-3C(7f,7r)], 

where we have introduced a bunch of constants, assumed to be positive, that 
we will more precisely set to 



X8 + Xg 


= 1, 


ai 


= Ae, 


T 
ai 


= A7, 


T 
ai 


= As, 


{Cg/3 - Xgas) — 

ai 


= A9. 



We get with P probability at least 1 — rj, 

(i-^-(C8 + C9)-)3<:(^,/i3)< 

\ (Xi Oil / 



+ 



— 7 ['^{xsps + xgpg){r) - psixsPe + Xgp7)(r)] 
+ ^log|(/i37f) (8) (/l37f)[exp{-7*^(i?',M') + aii?'} | 
(Ce + Ct + Cs + C9)£- log{/l3(7f®') [ew{-7^^{R', M') + aii?'}] } 
+ fi) - Xiils, P') + {^ + (Ce + Ct + Cs + C9)|-) log(|) . 



Let us choose the constants so that Ai = Ay = Ag, A4 = Ag = As, a^xg-^ = ^1 
and as^s^ = ■^4- This is done by setting 



X8 = 


^4 


6+^4' 


Xg = 


6 


6 + ^4' 


as = 


f sinh(^)(ei + e4) 


C6 = 


^sinh(^)(^^-^^: 


C7 = 


fsinh(^)i^ 
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Cs = ^Sinh(;i) 



C9 



fsinh(^)i^. 



The inequality Ai > ^1 is always satisfied. The inequality A4 > ^4 is required 
for the above choice of constants, and will be satisfied for a suitable choice 
of ^3 and C4. 

Under these asumptions, we obtain with P probability at least 1 — r/ 



(1 - ^ - (Cs + C9)-)3C(i/, <{i^- Jismpi + UP4){r) 

+ ^ log|(/i37f) ^ (psTf) exp{-j^j_{R',M') + aiR'}j } 

+ (Ce + Ct + Cs + Cg)^ log{/l3(vf^') [exp{-7*^ {R', M') + a^R!]] } 

+ Xiu, fi) - XiJi^, + + (Ce + Ct + Cs + C9) -) log (|) . 

\ai a.\/ ' 

This proves 



Proposition 1.61. The constants being set as explained above, with P 
probability at least 1 — rj, for any posterior distribution 1/ : J7 — > Mi_(M), 



3C(^,/i3)< (l-^-(C8+C9)-^"' 



+ 



X{u, us) 

^ log{(/l37f) (/l37f) [exp{-7^'^(ii;', M') + aiR'}^ } 
+ (Ce + Ct + Cs + Cq) — logics (vf®2) [cxp{-7^^(i?', M') + aii?'}] | 

+ (- + (C6 + C7 + C8 + C9)^)l0g(|) 

Thus 



3C(l^3/9l,//3 7ri) < 



l-^-(C8 + C9)^ 

logj (/l37f (8) (Ai37f) [exp{-7^i(i?',M') + aii?'}] } 
+ (Ce + Ct + Cs + C9)^ log{/i3(7f^') [exp{-7*^(i?', M') + aii?'}] } 



03 
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+ 



(^ + (C6+C7 + C8+C9)^) log(|) 

i [log{/l3(^f ) [exp{2Arsinh(2^)=^M'}] } - log(|) 



We will not go further, lest it may become tedious, but we hope we have given 
sufficient hints to state informally that the bound B{u, p, (3) of Theorem ll.591 
is upper bounded with P probability close to one by a bound of the same 
flavour where the empirical quantities r and m' have been replaced with 
their expectations R and M' . 



2.1. Basic inequalities. In this section the observed sample {Xi,Yi)f^^ 

will be supplemented with a shadow sample (Xj, 1^)^^^^^. This point of 
view, called transductive classification, has been introduced by V. Vapnik. 
It may be justified in different ways. 

On the practical side, one interest of the transductive setting is that it is 
often a lot easier to collect examples than it is to label them, so that it is not 
unreallistic to assume that we indeed have two training samples, one labelled 
and one unlabelled. It also covers the case when a batch of patterns is to be 
classified and we are allowed to observe the whole batch before issuing the 
classification. 

On the mathematical side, considering a shadow sample proves technically 
fruitfull. Indeed, when introducing the VC entropy and VC dimension con- 
cepts, as well as when dealing with compression schemes, albeit the inductive 
setting is our final concern, the transductive setting is a useful detour. In this 
second scenario, intermediate technical results involving the shadow sample 
are integrated with respect to unobserved random variables in a second stage 
of the proofs. 

Let us describe now the changes to be made to previous notations to adapt 
them to the transductive setting. The distribution P will be a probability 
measure on the canonical space = (Xxy)^^^-'^)^, and (Xj, Yi)^!^'^^ will be 
the canonical process on this space (that is the coordinate process). Unless 
explicitely mentioned, the parameter k indicating the size of the shadow 
sample will remain fixed. Assuming the shadow sample size is a multiple of 
the training sample size is convenient without significantly restricting the 
generality. For a while, we will use a weaker assumption than independence, 
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assuming that P is partially exchangeable, since this is all what we need in 
the proofs. 

Definition 2.1. For i = 1, . . . ,N, let ti : J7 — > be defined for any 

u = i^,)^^'' G n by 

Ti{u})i+jN = COi+(j-l)N, i = 1, ■ ■ ■ , ^, 

and Ti{uj)m+jN = iOm+jN, m i,m = 1, . . . , N, j = 0, . . . k. 

Clearly, if we arrange the {k + l)N samples in a A'^ x (k+1) array, Tj performs 
a circular permutation of A; + 1 entries on the ith row, letting the other rows 
unchanged. Moreover, all the circular permutations of the ith row have the 
form T- , j ranging from to k. 

The probability distribution P is said to be partially exchangeable if for 
any z = 1, . . . , A/", P o = P. 

This means equivalently that for any bounded measurable function h : 
n^R, F{hoTi) = P(/i). 

In the same way a function h defined on O will be said to be partially 
exchangeable if h o n = h for any i = 1,... ,N. Accordingly a posterior 
distribution p : Q M3|_(G,T) will be said to be partially exchangeable 
when p{u, A) = p[Tj(a;), A] , for any a; G J7, any i = 1, . . . , N and any A E:7. 

For any bounded measurable function h, let us define Ti{h) = Yl'j=o ^ ° 
. Let T{h) = T/v o • • • o Ti{h). For any partially exchangeable probability 
distribution P, and for any bounded measurable function h, P [^(/i)] = 
P(/i). Let us put 

(Ti{6) = 1 [/^/(Xj) 7^ Yi\ , indicating the success or failure of fe 

to predict Yi from Xi, 

1 ^ 

ri{9) = — ^ <yi{0)-i the empirical error rate of fe 
i=i on the observed sample, 

^ {k+l)N 

'r2{G) = TTT <7i(6'), the error rate of fe on the shadow sample, 



r{9) 



kN 

i=N+l 



k+1 {kTm ^ 

^ ^ i=i rate of j0, 
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i?j(6l) = P [feiXi) Yi\ , the expected error 

rate of fg on the ith input, 

1 ^ 

R{9) = ji^'^RiiG) = = P [^2(6*)] , the average expected 

1=1 

error rate of fg on ah inputs. 

We will allow for posterior distributions p : ^ Mj'^(0) depending on 
the shadow sample. The most interesting ones will anyhow be independent 
of the shadow labels Iat+i, . • • ,^(A;+i)Ar. We will be interested in the con- 
ditional expected error rate of the randomized classification rule described 
by p on the shadow sample, given the observed sample, which reads as 
F[p{r2)\{X„Y,)g,]. 

Let us comment on the case when P is invariant by any permutations of 
the rows, meaning that 

F[h{ujos)] =F[h{uj)] for dl\ s e + jN;j = 0, . . . ,k}) 
and alH = 1, . . . , A'' (where 6{A) is the set of permutations of A, extended 
to {1, . . . ,{k + l)N} so as to be the identity outside of A). In this case, if 
p is invariant by permutations of the rows of the shadow sample, meaning 
that p{u;os) = p{uj) £ M^(e), s £ 6{{i+jN;j = l,...,k}), i = l,...,N, 
then F[p{r2)\{Xi,Yi)g^] = iE^=i^[pi<^i+N)\{Xi,Yi)g,], meaning that 
the expectation can be taken on a restricted shadow sample of the same size 
as the observed sample. If moreover the rows are equidistributed (meaning 
that their marginal distributions are equal), then 

F[p{r2)m,Yi)fL,]=F[piaN+i)m,Yi)fL,]. 
This means that under these quite commonly fullfilled assumptions, the 
expectation can be taken on a single new object to be classified, our study 
thus covers the case when only one of the patterns from the shadow sample 
is to be labelled and one is interested in the expected error rate of this 
single labelling. Of course, in the case when P is i.i.d. and p depends only 
on the training sample (^j,ii)^i, we fall back on the usual criterion of 
performance F[p{r2)\{Zi)fLi] = p{R) = p{Ri). 

Let us recall the notation ^a{p) = —a~^ log{l — p[l ~ cxp(— a)] }. 

Using an obvious factorization, and considering for the moment a fixed 
value of 9 and any partially exchangeable positive real measurable function 
\ : Q —I- M.+ , we can compute the log Laplace transform of ri under T, 
which acts like a conditional probability distribution: 

log{T[exp(-Ari)] } = ^ logjr^ [exp(- A^.)] | 

1=1 



Olivier Catoni 



May 28, 2006 



2.1 Basic inequalities 



101 



Remarking that rjexp A[<I) A^(r) - ri] j | = exp[A$ A_(r)]r[exp(-Ari)] we 
obtain 

Lemma 2.1. For any 9^0 and any partially exchangeable positive real 
measurable function \ : ft —i- , 

r{exp[A{$ A [f{9)] - ri{9)}] } < 1. 

We deduce from this lemma a result analogous to the inductive case: 

Theorem 2.2. For any partially exchangeable positive real measurable func- 
tion A : O X ^ ]R_|_, for any partially exchangeable posterior distribution 



exp 



sup p 
.peMUe) 



X[^x{r)-ri] \ -Xip,7r) 



< 1. 



The proof is deduced from the previous lemma, using the fact that tt is 
partially exchangeable : 



sup p 



A[$A(f)-ri]J -X{p,Tr 

= p|7r|exp A[$A^(f) -ri]]}| = p|r7r|exp A[$ A^(f) - n]] } 

= p|7r|Tcxp[A[$A^(r) -n]]} 



< 1. 



Introducing in the same way 



1 ^ I 



i=l 



(k+l)N 



and m{e,9') = jj-^^ J2 \t[fe{Xi)^Yi\-l[feiXi)^Yi\ 



we could prove along the same line of reasoning 
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Theorem 2.3. For any real parameter A, any 6 ^ Q, any partially ex- 
changeable posterior distribution vr : J7 ^ JA\{Q), 



pjvl/. [r(.)-r(^),m(-,0)]} 

JV 

- [p(ri)-ri(^)]l -X(p,7r) \ < 1. 



jexp 


sup A 




Lpe3vti,(e) 



Theorem 2.4. For any rea/ constant 7, /or any G 0, /or any partially 
exchangeable posterior distribution vr : i7 — > MWB), 



P-^ exp 



sup <5 — -^P") log 

pG3Vti (6) 



l-tanh(^)[r(-)-rW]]} 
7[piri)-r,{e)] -iVlog[cosh(^)]p[m'(.,^)] -X{p,7T] 



< 1. 



This last theorem can be generalized to give 

Theorem 2.5. For any real constant 7, for any partially exchangeable pos- 
terior distributions 7r^,7r^ : Q ?y[^{Q), 



¥< exp 



sup |-iVlog|l - tanh(^) [p^{r) - p2ir)] } 
-l[piiri) - P2iri)] - A^log[cosh(;^)]/9i p2("i') 

-X{Pi,7t')-X{p2,7t^, 



< 1. 



To conclude this section, we see that the basic theorems of transductive 
PAC-Bayesian classification have exactly the same form as the basic inequal- 
ities of inductive classification, Theorems 11.41 11.251 and 11.261 with R{9) re- 
placed with r{6), r{9) replaced with ri{9) and M'{9, 9) replaced with m{9, 9). 



Thus all the results of the first section remain true under the hypotheses of 
transductive classification, with R{9) replaced with r{9), r{9) replaced with 
ri{9) andM'{9,9) replaced withm{9,9). 

Consequently, in the case when the unlabelled shadow sample is observed, 
it is possible to improve on Vapnik's bounds to be discussed hereafter by us- 
ing an explicit partially exchangeable posterior distribution vr and resorting 
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to localized or to relative bounds (in the case at least of unlimited computing 
resources, which of course may still be unrealistic in many real world situ- 
ations, and with the caveat, to be recalled in the conclusion of this article, 
that for small sample sizes and comparatively complex classification models, 
the improvement may not be so decisive). 

Let us notice also that the transductive setting when experimentally avail- 
able, has the advantage that 

{k+l)N 

>m{e,e')>f{9)-f{9'), e,e'e@, 

is observable in this context, providing an empirical upper bound for the 
difference r(9) — p{r) for any non randomized estimator 6 and any posterior 
distribution p, namely 

f{e)<p{r) + p[d{;e)]. 

Thus in the setting of transductive statistical experiments, the PAC-Bayesian 
framework provides fully empirical bounds for the error rate of non random- 
ized estimators : O — > O, even when using a non atomic prior vr (or 
more generally a non atomic partially exchangeable posterior distribution 
tt), when is not a vector space and 6 i— > R(^9) cannot be proved to be 
convex on the support of some useful posterior distribution p. 

2.2. Vapnik's bounds for transductive classification. In this sec- 
tion, we are going to stick to plain unlocalized non relative bounds. As we 
have already mentioned, (and as it was put forward by Vapnik himself in 
his seminal works), these bounds are not always superseded by the asymp- 
totically better ones, and deserve all our efforts since they deal in many 
situations better with small samples. 

2.2.1. With a shadow sample of arbitrary size. The great thing with the 
transductive setting is that we are manipulating only Vi and r which can 
take but a finite number of values and therefore are piecewise constant on 
0. To make use of this, let us consider for any value ^ G O of the parameter 
the subset A(0) C 6 of parameters 6' such that the classification rule fgi 
answers the same on the extended sample (Xi)^^^^^ as fe. Namely, let us 
put for any 9 e Q 

A{9) = {9' e e- fg,iXi) = fe{Xi),i = l,...,{k + 1)N}. 
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We see immediately that A(^) is an exchangeable parameter subset on which 
ri and r2 (and therefore also r) take a constant value. Thus for any ^ € 
we may consider the posterior pg defined by 

^{6') = l[e' e A{6)]7r[Aie)y\ 

and use the fact that po{ri) = ri(0) and p0{r) = r(0), to prove that 

Lemma 2.6. For any partially exchangeable positive real measurable func- 
tion A : i7 X — > K, such that 

A(cj, 9') = X{lo, 9), 9ee,9' e A{e),io G Q, (2.1) 

and any partially exchangeable posterior distribution vr : 17 — s- M|'j_(0), with 
P probability at least 1 — e, for any 9 ^ Q, 

^MfM <,,,). 

We can then remark that for any value of A independent of to, the left- 
hand side of the previous inequality is a partially exchangeable function of 
a; € 0. Thus this left-hand side is maximized by some partially exchangeable 
function A, namely 

r , log|e7rrA(e)l| 
argmax$^ \r{9)] + — 

A iv A 

is partially exchangeable as depending only on partially exchangeable quan- 
tities. Moreover this choice of A(a;, 9) satisfies also condition ()2.1|) stated in 
the previous lemma of being constant on A{9), proving 

Lemma 2.7. For any partially exchangeable posterior distribution vr : Q — > 
'M.^{@), with P probability at least 1 — e, for any 9 G @ and any A G 

, log{e7rrA(6')l| 

iv A 

Writing r = and rearranging terms we obtain 

Theorem 2.8. For any partially exchangeable posterior distribution vr : 
Q. M^(0), with P probability at least 1 — e, for any 9 £ @, 

r2{9) < — — mf ^ ; — ^- ; — . 

k AGIR+ 1 - exp(-A) k 
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Let us remind the reader that in the case when we have a set of binary 
classification rules {fe]0 G ©} whose VC dimension is not greater than h, 
then we can choose vr such that 7r[A(0)] is independent of and not less 

that ( ^ , ^ . 

Another important case when the complexity term — log{7r[A(^)] } can 
easily be controlled is the setting of compression schemes^ introduced by 
Littlestone et Warmuth |21]. In this case, we are given for each labelled 
subsample (Xj, Yiji^j^ J C {1, . . . , N}, an estimator of the parameter 

e[{x,,Yi)i^j] =ej, J c {1, . . . , A^}, I J| < K 

where 

N 
k=l 

is an exchangeable function providing estimators for subsamples of arbitrary 
size. Let us assume that 9 is exchangeable, meaning that for any k = 1, . . . ,N 
and any permutation o" of {1, . . . , A;} 

e[{xi,yi%^] = ^[(x^(,),y^(i))*Li], {xi,yi)'l=i G (X x y)^ 

In this situation, we can introduce the exchangeable subset 

\ej; J C {1, . . . , (A: + 1)N}, \J\ < /i} C 9, 



which is seen 




classifi- 



cation rules (as will be proved later on in Theorem 13 . 141 on page 113^ . Note 
that we had to extend the range of J to all the subsets of the extended sam- 
ple, although we will use for estimation only those of the training sample, on 
which the labels are observed. Thus in this case also we can find a partially 

exchangeable posterior distribution vr such that vr [A{6 j)] > 

We see that the size of the compression scheme plays the same role in this 
complexity bound as the VC dimension for VC classes. 

In these two cases of binary classification with VC dimension not greater 
than h and compression schemes depending on a compression set with at 
most h points, we get a bound of 



+ 1)N 
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r2{0)<t^ inf 

k AGIR+ 



l-exp|-A,,(,) 



log(e) 



l-exp(-A) 



nje) 

k ' 



Let us make some numerical application: when = 1000, h = 10, e = 0.01, 
and infe ri = ri{9) = 0.2, we find that r2{9) < 0.4093, for k between 15 and 
17, and values of A equal respectively to 965, 968 and 971. For k = 1, we 
find only r2{9) < 0.539, showing the interest of allowing k to be larger than 
1. 

2.2.2. When the shadow sample has the same size as the training sample. 
In the case when k = 1, we can improve Theorem 12.21 bv taking advantage 
of the fact that Ti{ai) can take only 3 values, namely 0, 0.5 and 1. We see 
thus that Ti{cji) — <^ \_ \Ti{ai)\ can take only two values, and ^ — ^a(^)) 

because ^'a(O) = and $a(1) = 1- Thus 

N N 

T,{ai)-^^[Ti{<j,)\ = [1- |l-2r,(a,)|] 
This shows that in the case when k = 1, 

log{T[exp(-Ari)] } = -Ar + - J^T,(ai) - $ A [ri(ai)] 



i=l 



N 



= ->^r + -Y.[^-\l-2TM^)\] [h-^^m 
1=1 

<_Ar + A[i-<I>.(i)] [l-|l-2r|]. 
Noticing that \ — 2l{\) = ^ log[cosh(2^)] , we obtain 

Theorem 2.9. For any partially exchangeable function A : x G ^ ]R,+ , 
for any partially exchangeable posterior distribution vr : 17 — > Mi_(0), 



exp 



sup p 



A(r — ri) 



iVlog[cosh(2^)](l-|l-2r|)l -X(p,7r) 



< 1. 
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As a consequence, reasonning as previously, we deduce 

Theorem 2.10. In the case when k = 1, for any partially exchangeable 
posterior distribution vr : $7 ^ Jy{^(@), with P probability at least 1 — e, for 
any ^ € G and any A € 11+, 

m - f log[cosh(^)] (1 - |1 - 2ri9)\) + < r^i9)■, 

and consequently for any 9 ^ Q, 

,^(,) _ M^-[m]} 

r2(9) < 2 inf ^, — ^ , , - ri(9). 

In the case of binary classification using a VC class of VC dimension not 
greater than /i, we can choose vr such that — log|7r[A(0)] } < /ilog(^^) and 
obtain the following numerical illustration of this theorem : for = 1000, 
h = 10, e = 0.01 and inferi = ri[9) = 0.2, we find an upper bound 
r2{9) < 0.5033, which improves on Theorem 12.81 but still is not under the 
significance level ^ (achieved by blind random classification). This indicates 
that considering shadow samples of arbitrary sizes brings in some noisy 
situations a significant improvement on bounds obtained with a shadow 
sample of the same size as the training sample. 

2.2.3. When moreover the distribution of the augmented sample is exchange- 
able. In the case when k = 1 and P is exchangeable meaning that for 
any bounded measurable function /i : — > P and any permutation s € 
S({1, . . . , 2A^}) P[/i(cjos)] = P , then we can still improve the bound 
as follows. Let 

see[{N+i,...,2N}) 

Then we can write 

1 - |1 - 2Ti{ai)\ = (cTj - fJi+Ar)^ = (Ti + ai+N - ^cTiUi+N- 
Using this identity, we get for any exchangeable function A : 17 x — > P_|_, 

r<^ exp A(r - ri) - log[cosh(2^)] ^(cJi + o-j+Ar - 2cJifJi+Ar) ^< 1. 
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Let us put 

^(A) = ^log[cosh(2^)], (2.2) 
1 ^ 

Vi9) = — ^(o-i + O-j+AT - 2(Ti(Ti+Ar). (2.3) 
i=l 

With these notations 

r|exp{A[r-ri - A(A)';;]}| < 1. 

Let notice now that 

T'[v{e)] =r{9)-ri{e)r2i9). 

Let TT : O — >■ M5,_(6) be any given exchangeable posterior distribution. Using 
the exchangeabihty of P and vr and the exchangeabihty of the exponential 
function, we get 

p|7r^exp{A[r-ri - A(r-rir2)]} | = p|7r exp{A[r - n - ^r'(r;)] } | 
< p|7rJr'exp{A[r-ri -^u]} | = pjr'vr exp{A [r - ri - ylu] } | 
= p|7r[exp{A[r - n - ^u] } ^ = F^T-k exp{X[r - n - Av]} | 
= pjvr rexp{A[r-ri - Av]} } < 1- 

We are thus ready to state 

Theorem 2.11. In the case when k = 1, for any exchangeable probability 
distribution P, for any exchangeable posterior distribution ir : Q JA^{@), 
for any exchangeable function A : x — > ]R,_|_, 

piexp sup p\x[r-ri-A{X){r-rir2)]}-X{p,'ir) 1 < 1, 
I lpeM]_{e) ^ J J 

where A(X) is defined by equation (|2.2() above. 

We then deduce as previously 

Corollary 2.12. For any exchangeable posterior distribution vr : 17 — > 
3V[^(0), for any exchangeable probability measure P G JA\{Q,), for any mea- 
surable exchangeable function A : Q x — > ]R,_|_, with P probability at least 
1 — e, for any 6 £ Q, 



r{d) < ri{0) + A{X) [r{e) - ri{e) 



r2 



log{e^[A(0)]} 
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where A(X) is defined by equation (|2.2() on vaae MU^ 

In order to deduce an empirical bound from this theorem, we have to make 
some choice for A(a;, 6). Fortunately, it is easy to show that the bound indeed 
holds uniformly in A. This is the case because the inequality can be rewritten 
as a function of only one non exchangeable quantity, namely ri{9). Indeed, 
since r2 = 2r — ri , we see that the inequality can be written as 

2. log{en[A{B)] 



r{0) < ri{e) + ^(A) [r{e) - 2r{e)ri{e) + ri{e) 
It can be solved in ri{0), to get 

ri(0) >/(A,r(^),-log{e7r[A(0)]}), 

where namely 

f{X,r,d) = [2A(A)]-'|2r^(A)-l 

+ y [1 - 2r^(A)] ' + 4^(A){r [1 - ^(A)] - f } |. 

Thus we can find some exchangeable function X{uj,9), such that 
f(xiu;,e),rie),-log{e7r[Aie)]}) = sup /(/3,r(^), -log{e^[A(^)] 

^ ^ /3GIR+ ^ 

Applying Corollarv 12. 121 to that choice of A, we see that 

Theorem 2.13. For any exchangeable probability measure P € for 
any exchangeable posterior probability distribution vr : 17 — > M.\{Q), with P 
probability at least I — e, for any 9 £ @, for any X € , 

m < n{e) + A{X) [r{9) - ri{9)r2{9)\ - ^°g{^^[^^(^)] ) ^ 

where A{X) is defined by equation (|2.2() on vaae MU^ 
Solving the previous inequality in ^2(6'), we get 

Corollary 2.14. Under the same assumptions as in the previous theorem, 
with P probability at least 1 — e, for any 9 G Q, 



r2{9) < inf 



n(^){i + ^log[cosh(^)]}- ^^"^^-;[^(^^]^ 



AeiR+ 1 - ^ log[cosh(2^)] [1 - 2n{9)] 
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Applying this to our usual numerical example of a binary classification model 
with VC dimension not greater than h = 10, when = 1000, infe ri = 
ri(e) = 10 and e = 0.01, we obtain that r2(9) < 0.4450. 

2.3. Vapnik's bounds for inductive classification. 

2.3.1. Arbitrary shadow sample size. We assume in this section that 

(Af \ ® oo 
®Pi) €Mi{[(Xxy)^]'^}, 

where Pi € M?|_ (X x : we consider an infinite i.i.d. sequence of independent 
not identically distributed samples of size A^, the first one only being ob- 
served. The shadow samples will only appear in the proofs. The aim of this 
section is to prove better Vapnik's bounds, generalizing them in the same 
time to the independent non i.i.d. setting, which to our knowledge had not 
been done before. 

Let us introduce the notation P'[/i(u;)] = P[/i(u;) | (^"4,1^1)^]^], where h 
may be any suitable (e.g. bounded) random variable, let us also put = 
[(Xxy)^]"". 

Definition 2.2. For any subset yl C IN of integers, let ^{A) be the set of 
circular permutations of the totally ordered set A, extended to a permutation 
of IN by taking it to be the identity on the complement IN \ A of A. We will 
say that a random function /i : $7 ^ K, is fc-partially exchangeable if 

h{io os) = h{Lo), s G ^{{i+jN; j = 0, . . . , k}) ,i = I, . . . , N. 

In the same way, we will say that a posterior distribution n : ^1. —>■ M5|_(G) 
is /c-partially exchangeable if 

7r(cjos) = ir{uj) € M^(e), s G (t{{i+jN;j = 0, . . . ,k}) ,i = 1, . . . , N. 

Note that P itself is /c-partially exchangeable for any k in the sense that for 
any bounded measurable function h : ^ M. 

P [h{uj os)] = P [h{uj)] , s G (t{{i +jN;j = 0,...,k}),i = l,...,N. 

Let Ak{e) = {e' G 6; [/^'(X,)] Jt'^"^ = Slt'^"^}, ^ G e,A; G IN*, 

^ (fc+l)Af 

and let also rk{e) = — — ^ t[fe{Xi) ^ Yi]. Theorem ESI shows 
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that for any positive real parameter A and any /c-partially exchangeable 
posterior distribution tt^ : ^ —>■ M5,_(0), 



exp 



sup A (rk) - ri] + loglevrfc [Afc(6l)] } 



< e. 



Using the general fact that 

P[exp(/i)] = p|p'[exp(/i)]| > p|exp[P'(/i)]|, 

and the fact that the expectation of a supremum is larger than the supremum 
of an expectation, we see that with P probability at most 1 — e, for any € 0, 

F'hog{eTTk[Ak{e)]}} 
P'{<1> . [rm] } < n{e) - ^ ^ ^. 

Let us put for short 

4W = -log{e7r4Afc(e)]}, 
4(0) = -P'{log{e^,[A,(e)]}}, 
4W = -P{log{e7rfc[Afc(e)]}}. 

We can use the convexity of <I> a_ and the fact that P'(rfc) = ^^j^^ , to see 
that 

'ri{e)+kR{9y 



k + l 



We have proved 



Theorem 2.15. Using the above hypotheses and notations, for any sequence 
TTfc : 17 — > JA^i^Q), where tt^ is a k-partially exchangeable posterior distribu- 
tion, for any positive real constant A, any positive integer k, with P proba- 
bility at least 1 — e, for any 6^0, 



JV 



ri{0) + kR{e) 
k+l 



<ri{9) + 



dm 

A ■ 



We can make as we did with Theorem II.IUI on page 1201 the result of this 
theorem uniform in A G {a^ ; j € M*} and k £ M* (considering on k the 
prior and on j the prior -""^^^O , and obtain 
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1 — exp< 


-tn{e) - + iog[Mfc + + 1)] }} 


k 
fc+1 


1 — exp 


(-) 





Theorem 2.16. For any real parameter a > 1, with P probability at least 
1 — e, for any 6 G Q, 

R{0) < 



inf 



ri{e) 

k ' 

Note that as a special case we can choose vTfc such that log{7rfc [Afc(6')] } is 

independent of 9 and equal to log(9Tfc), where DT^ = |{ [/^(^i)] ^^^^^^^ ; 9 € 
e}| is the size of the trace of the classification model on the extended sample 
of size {k + l)N. With this choice, we obtain a bound involving a new flavour 
of conditional Vapnik's entropy, namely 



d',{9)=F[log{mk)\{Z, 



log(e) 



In the case of binary classification using a VC class of VC dimension not 
greater than h = 10, when N =J.OOO, infe ri = ri{9) = 0.2 and e = 0.01, 
choosing a = 1.1, we obtain R{9) < 0.4271 (for an optimal value of A = 
1071.8, and an optimal value of A; = 16). 

2.3.2. A better minimization with respect to the exponential parameter. If 
we are not pleased with the fact of optimizing A on a discrete subset of the 
real line, we can use a slightly different approach. From Theorem 12.21 we 
see that for any positive integer k, for any /c-partially exchangeable positive 
real measurable function A : O x — > ]R,_|_ satisfying equation 1)2. 1|) on page 
UnKwith A{9) replaced with Ak{9)), for any e g)0, 1) and r? g)0, 1), 



P<^ P' 



exp supA[^>A(rfc) - n] +log{ery7rfc[Afc(6')]} 
therefore with P probability at least 1 — e, 

P'iexp[supA[$A(rfc) -ri] +log{eriTTk[Ak{9)]} 



< 



and consequently, with P probability at least 1 — e, with P' probability at 
least 1 — rj, for any 6 £ Q, 

log{er77r4Afc(g)]} 
<^^{rk) H — < ri. 

N A 
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Now we are entitled to choose 



X{lo,9) G arg max ^x'{rk) + 

A'G1R,+ Iv 



log{e?77rfe [Ak{e)]} 



X' 



This shows that with P probabiHty at least 1 — e, with P' probability at 
least 1 — T], for any 9 € Q, 

4W-log(^) . 

sup ^A{rk) 7 < ri, 

AGR+ ^ X 

which can also be written 



^ (- , d,(e) iog(7?) 

jv A A 



XeR: 



Thus with P probability at least 1 — e, for any 6 € @, any A G ]R,+, 



P' 



N A 



< - 



log(r?) 
A 



1 -ri + 



On the other hand, $ a. being a convex function, 

JV 



X 



log(ry) 



di 



N '- A 



kR + ri 



iv \ A; + 1 

Thus with P probability at least 1 — e, for any 9 £ Q, 



d'k 



kR + ri 1 
— < mf $ / 



fc + 1 



AGIR.+ N 



ri{l - rj) + T] + 



4 -log(7?)(l -??) 
A 



We can generalize this approach by considering a finite decreasing sequence 
rjQ = 1 > rji > ri2 > ■ ■ ■ > rjj > ryj+i = 0, and the corresponding sequence of 
levels 



A 



Lj+i = 1 - n 



log(J) - log(e) 
A 



Taking a union bound in j, we see that with P probability at least 1 — e, for 
any G 0, for any A G 



P' 



' f- . dk + log( J) 

N A 



<r]j, j = 0, ...,J + 1, 
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and consequently 



dk + log( J) 



A 



< 



Jo L ~ 



A 



VJ 



log(J) - log(( ) - log(//./) 
A 



1 — ri 

Let us put 

4 [e, (r?,)/=i] = 4(0) + log( J) - log(r/i) 

J-i 



J+i 

da < ^r]j-i{Lj - Lj_i) 
log(?7i' 



A 



Vj+i 



We have proved that for any decreasing sequence (?7j)j=i, with P proba- 
bihty at least 1 — e, for any G 6, 



kR + n 



< inf $" 



A; + 1 AeiR.+ 



n(l - rjj) +rjJ + 



A 



Remark 2.1. We can for instance choose J = 2, 772 = jgiv' '^i ~ log(iOAr) • 
resulting in 

4.4 + l„g(2) + loglogdOiV, + 1 - - 



log(lOAr) 



lOA^ 



In the case when N = 1000 and for any e G (0, 1), we get ^ ^ 4 + 3-7, in 
the case when N = 10^, we get ^ < d'f. + 4.4, and in the case AT = 10^, we 
get 4' < 4 +4.7. 

Therefore, for any practical purpose we could take 4 = 4 + 
r)j = in the above inequality. 

Taking moreover a weighted union bound in k, we get 

Theorem 2.17. For any e g)0, 1), any sequence 1 > rji > ■ ■ ■ > rjj > 0, any 
sequence tt^ : — >■ M.\_{Q), where iTk is a k-partially exchangeable posterior 
distribution, with P probability at least 1 — e, for any e 0, 
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R(e) < inf inf 

fcGM* k AeiR,+ N 



n{e) + 7]j[i-n{0)] 
d'l[9,{vj)^^,]+log[k{k + l)] 



+ 



A 



ri(^) 



Corollary 2.18. For any e g)0, 1), for any N < 10^ , with P probability 
at least 1 — e, for any 9 € @, 



R(e)< inf inf -^^[l-expf-^; 



1 — exp 



' N 



P' [log(9afe) I (Zi)ili] - log(e) + log [k{k + 1)] + 4.7' 



nie) 



Let us end this section with a numerical example: in the case of binary 
classification with a VC class of dimension not greater than 10, when N = 
1000, infe n = ri(e) = 0.2 and e = 0.01, we get a bound R{e) < 0.4211 (for 
optimal values of A; = 15 and of A = 1010). 

2.3.3. Equal shadow and training sample sizes. In the case when k = 1, we 

can use Theorem l2.1()[ and replace $ with {l — 3^ log [cosh(2^)] } ^g, 

Iv 

resulting in 

Theorem 2.19. For any e e)0, 1), any N < 10^, any 1-partially exchange- 
able posterior distribution tti : — > M^(0), with P probability at least 1 — e, 
for any 9 €z Q, 



R(9) < inf 

AGIR+ 



{1 + ^log[cosh(^)]}n(^) + ^ + 2<(^ 



1 



■log[cosh(27v)] 



2.3.4. Improvement on the equal sample size bound in the i.i.d. case. Even- 
tually, in the case when P is i.i.d., meaning that all the Pi are equal, we 
can improve the previous bound. For any partially exchangeable function 
A : 57 X — >■ ]R,_|_, we saw in the discussion preceding Theorem 12 .111 on page 
HoHlthat 



exp[A(rfc - ri) - ^(A)^ 



< 1, 



with the notations introduced therein. Thus for any partially exchangeable 
positive real measurable function X : Q x @ ^ P_,. satisfying equation (|2.1|) 
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on page I1U41 any 1-partially exchangeable posterior distribution vri : 17 — > 

F\exp\snpX[rk{e)-ri{e)-A{X)v{e)] + log[e^i [A(0)]1 1 < 1. 
Therefore with P probabihty at least 1 — e, with P' probability 1 — ij, 

fk{e) < n{e) + A{\)v{9) + j[di{e)-iog{r])] 

, di{e) - \og{r])\ 
We can then choose Afw, 9) € arg min ^(A )v{9)-\ , which 

A'GIR,+ A 

satisfies the required conditions, to show that with P probability at least 
1 — e, for any € 0, with P' probability at least 1 — ?/, for any A € 1R+, 



rk{e)<ri{e)+A{\)v{e) + 



Ji(g)-log(r?) 
A 



We can then take a union bound on a decreasing sequence of J values 
f?i > • • • > of r/. Weakening a little the order of quantifiers, we then 
obtain the following statement: with P probability at least 1 — e, for any 
^ G 6, for any A G ]R,+ , for any j = 1, . . . , J 



P' 



A 



A 



< 



Consequently for any A € K,. 



P' 



rk{e) - n{e) - A{X)v{e) 



Ji(0)+log(J) 



l-ri{e) 



A 



log(J) - log(e) - log(r?j) 



A 



J-i 



Moreover P'[t;(0)] = ^-^-^ — riR, (this is where we need equidistribution) 
thus proving that 



R-n ^ A{X) 



R + n- 2riR 



2-2 
Keeping track of quantifiers, we obtain 



, 4[g.fa)/=l] , .... 

H ^ ^Vj[^-ri{0)\. 



A 
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Theorem 2.20. For any decreasing sequence {r]j)j^^, any e G (0,1), any 
l-partially exchangeable posterior distribution n : —>■ ^^^.(O), with P prob- 
ability at least 1 — e, for any 6 E @, 



R(9) < inf 
AeiR,+ 

{i + ^iog[cosh(^)]}n(g) + '<[^'fa-)/=i] +2nj[l-r,{e)] 

1-f log[cosh(^)] [l-2r,{e)] ■ 

2.4. Gaussian approximation in Vapnik's bounds. To obtain formu- 
las which could be easily compared with original Vapnik's bounds, we may 
replace p — $o(p) with a Gaussian upper bound: 

Lemma 2.21. For any p e (0, any a G 



For any p G (^, 1), 



p-^a{p) < ^p{l-p). 



P - ^a(p) < 



Proof. Let us notice that for any p G (0, 1), 

d r ^ , .1 pexp(-a) 

— [-a^aip)] = — ^ 7 r, 

oa 1— p + pexp(— a) 

pexp(— a) 



^2 

-2- [-a^a{p)] = , , . ^ 

a^a 1 — p + pexp(— aj 



pexp(— a) 



1 — p + pexp(— a) 



^ipil-p) pG(0,i), 



•1 



1 . 



Thus taking a Taylor expansion of order one with integral remainder : 



—ap+ / p{l — p){a — b)db 
Jo 



— a$(a) < < 



-ap+^p{'^-p), P^(0,l), 



-a]3 + y^ -{a-h)db= -ap+ —, p(^{\,l). 



This ends the proof of our lemma. □ 
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Lemma 2.22. Let us consider the bound 

-1 



B{q,d) -. 
Let us also put 



2d 



d 



2dq{l 



N 



Ar2 



q G R+,d G ]R+. 



Biq,d) 



'B{q,d) B{q,d)<l 
Q + y otherwise. 



For any positive real parameters q and d 

d 

AGIR+ jv 



inf <D-i(g+^ ) <B{q,d). 



Proof. Let p = inf ( 9 + T ) • ^ ^ 

A JV \ Ay 



]R,4 



A 



P-^(pAi)[l-(pAi)] <^,{p)<q+- 



Thus 



2d(pAi)[l-(pAi) 



d 

2N' 



Then let us remark that -B(g, d) = sup < G ]R,4. ; p < q + 



2dp'{l-p') 
N 



If moreover ^ > B{q,d), then according to this remark ^ > q + \J > p. 

Therefore p < ^, and consequently p < q + ^J^^^^^, implying that p < 
B{q,d). □ 

2.4- 1- Arbitrary shadow sample size. This lemma combined with Corollary 
12.181 on page 11151 implies 

Corollary 2.23. For any e g)0, 1), any integer N < 10^, with P proba- 
bility at least 1 — e, for any 6 ^ Q, 



R{e) < ini^ [rm + 4(^) + log[MA; + 1)] + 4.?] } 



ri{e) 
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2.4-2. Equal sample sizes in the i.i.d. case. To make a link with Vapnik's 
result, it is useful to work out the Gaussian approximation to Theorem 12. 201 
on page 11171 Indeed, using the upper bound ^(A) < where A[\) is 
defined by equation (|2.2j) on page 11081 we get with P probability at least 
1 - e 



which can be solved in R to obtain 

Corollary 2.24. With P probability at least 1 — e, for any 9 e Q, 

Rie)<r,ie) + ^[l-2r,{0)]+27jj 
^ Ud'Ue)\l-r^(i 



N iV2 L ^ 

This is to be compared with Vapnik's result, as proved in 37, page 138]: 

Theorem 2.25 (Vapnik). For any i.i.d. probability distribution P, with P 
probability at least 1 — e, for any G G, putting 

dv = log[F{mi)] +log(4/e). 



Recalling that we can choose such that rjj = (which is neglige- 

able by all means) and such that for any N < 10^, 

< P [log(mi) I (Zi)ili] - log(e) + 4.7, 

we see that our complexity term is somehow more satisfactory than Vapnik's, 
since it is integrated outside the logarithm, with a little larger additional 
constant (remember that log(4) ~ 1.4, which is better than our 4.7, which 
could presumably be improved by working out a better sequence r]j , but not 
down to log(4)). Our variance term is better, since we get ri(l — ri) as we 

d" dy 

should, instead of only ri. We also have — instead of 2 — , comming from 
the fact that we do not use any symmetrization trick. 

Let us illustrate these bound on a numerical example (corresponding to 
a situation where the sample is noisy or the classification model is weak). 
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Let us assume that = 1000, infe ri = ri(9) = 0.2, that we are performing 
binary classification with a model with VC dimension not greater than h = 
10, and that we work at level of confidence e = 0.01. Vapnik's theorem 
provides an upper bound for R{9) not smaller than 0.610, whereas Corollary 
EH gives R(0) < 0.461 (using the bound d'l < d[ + 3.7 when N = 1000). 
Now if we go for Theorem 12 . 201 and do not make a Gaussian approximation, 
we get R{9) < 0.453. It is interesting to remark that this bound is achieved 
for A = 1195 > N = 1000. This explains why the Gaussian approximation 
in Vapnik's bound can be improved: for such a large value of A, Xri{6) does 
not behave like a Gaussian random variable. 

Let us remind in conclusion that the best bound is provided by Theorem 
12.171 giving R{9) < 0.4211, (that is approximately 2/3 of Vapnik's bound), 
for optimal values of fc = 15, and of A = 1010. This bound can be seen to 
take advantage of the fact that Bernoulli random variables are not Gaussian 
(its Gaussian approximation, Corollary [2221 gives a bound R{9) ~ 0.4325, 
still with an optimal k = 15), and of the fact that the optimal size of the 
shadow sample is significantly larger than the size of the observed sample. 
Moreover, Theorem 12 . 1 71 does not assume that the sample is i.i.d., but only 
that it is independent, thus generalizing Vapnik's bounds to inhomogeneous 
data (this will presumably be the case when data are collected from differ- 
ent places where the experimental conditions may not be expected to be 
the same, although they may reasonably be assumed to be independent). 
We would like also to emphasis that our little numerical example shows 
that Vapnik's bounds can be expected to be appropriate when dealing with 
moderate sample sizes. More sophisticated bounds obviously have a better 
asymptotic behaviour as shown in the first section. Nevertheless the numer- 
ical illustration of Theorem 11.181 given on page EOl suggests hat Vapnik's 
bounds are not doing so bad for small to medium ratios between the sample 
size and the dimension of the classification model (with local bounds, we 
could only get down to 0.332, although using a quite stronger dimension 
assumption). 

We chose on purpose an example where it is non trivial to decide whether 
the chosen classifier does better than the 0.5 error rate of blind random 
classification. We think that this situation of weak learning is of practical 
interest, since "significant" weak learning rules may afterwards be aggre- 
gated or combined in various ways to achieve better classification rates. 
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3.1. How TO BUILD THEM. 

3.1.1. The canonical hyperplane. Support Vector Machines, of widely spread 
use and renown, were introduced by V. Vapkik [37]. Before introducing them, 
we win study as a prerequisite the separation of points by hyperplanes in 
a finite dimensional Euclidean space. Support Vector Machines perform the 
same kind of linear separation after an implicit change of pattern space. The 
preceding PAC-Bayesian results provide a fit framework to analyze their 
generalization properties. 

We will deal in this section with the classification of points in M!^ in two 
classes. Let Z = {xi,yi)fLi G (IR,'' x {— 1,+1})^ be some set of labelled 
examples (called the training set hereafter). Let us split the set of indices 
/ = {1, . . . , N} according to the labels into two subsets 

= {i e I :yi = +1}, 
l_ = {iel ■.y^ = -1}. 

Let us then consider the set of admissible separating directions 

Az = {w e B!^ : supinf(('u;,Xi) - b)yi > l}, 

which can also be written as 

Az = {w € M!^ : max{w,Xi) + 2 < mm{w, Xi) \ . 

As it is easily seen, the optimal value of b for a fixed value of w, in other 
words the value of b which maximizes mfi^j{{w,Xi) — b)yi, is equal to 

bm = - max(ty,Xi) + m.m{w,Xi) 
2 Lie/- 



Lemma 3.1. When Az ^ 0, ini{\\w\\'^ : w G Az} is reached for only one 
value wz of w. 

Proof. Let wq e Az. The set Az [w ^ : \\w\\ < \\wo\\} is a compact 
convex set and w i— > is strictly convex and therefore has a unique 

minimum on this set, which is also obviously its minimum on Az- □ 
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Definition 3.1. When Az 7^ 0, the training set Z is said to be hnearly 
separable. The hypcrplane 

H = {x&^'^ : {wz, x)-hz = 0}, 

where 

Wz = argmin{||u;|| : w G Az}, 
bz = hwz , 

is called the canonical separating hyperplanc of the training set Z. The 
quantity is called the margin of the canonical hyperplane. 

Note that as minjg/^ (w^, Xj) — maxj£/_ (-u;^, ,x,) = 2. the margin is also 
equal to half the distance between the projections on the direction wz of 
the positive and negative patterns. 

3.1.2. Computation of the canonical hyperplane. Let us consider the convex 
hulls X+ and X_ of the positive and negative patterns: 

X+ = AiXi : (A,).^,^ e <+, ^ Ai = 1}, 

X_ = [^XiXi : {Xi).^j_ G ^ Ai = 1}. 
iei- iei- 

Let us introduce the closed convex set 

V = X+ — X_ = {x+ — a;_ : x+ G X+ , x_ G X_ } . 

As V is strictly convex, with compact lower level sets, there is a 

unique vector v* such that 

\\v*f = mf\\\vf -.veV). 



Lemma 3.2. The set Az is non empty (i.e. the training set Z is linearly 
separable) if and only if v* / 0. In this case 

2 * 
Wz = iprp^ ' 

and the margin of the canonical hyperplane is equal to ^\\v*\\. 
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Proof. Let us assume first that v* = 0, or equivalently that X+ n X_ 7^ 0. 
As for any vector w € M!^, 

mm{w,Xi) = imio.{w,x), 

nieLx{w,Xi) = m.eLx(w,x), 
i£i- xeX- 

we see that necessarily minjg/_|_ (w, Xj) — maxjg/_ (ty, x^) < 0, which shows 
that w cannot be in Az and therefore that Az is empty. 

Let us assume now that v* 7^ 0, or equivalently that X+ fl X_ = 0. Let 
us put w* = -plipu*. Let us remark first that 

min(u)*,Xi) — max(it;*, rcj) = inf [w* ,x) — sup {ui*,x) 
iei+ iei- xex+ ^^X- 

= inf {w* ,x-\. — X-) 

x+GX+,X-€X- 

2 

inf {v*,v). 



Let us now prove that miy^'^{v* ,v) = \\v*\\'^. Some arbitrary v E V being 
fixed, consider the function 

p^\\Pv + {l- p)v*f : [0,1] ^H. 

By definition of v*, it reaches its minimum value for /3 = 0, and therefore 
has a non negative derivative at this point. Computing this derivative, we 
find that {v — v*,v*) > 0, as claimed. We have proved that 

min{w* , Xi) — max(i(;*,Xi) = 2, 

and therefore that w* G Az- On the other hand, any w G Az is such that 
2 < mmlw, Xi) — maxiw, Xi) = inf (w, v) < \\w\\ inf ||i;|| = ||u;|| \\v* 11. 

~ iG/+ v£V^ ' / - M 11^^^ 

This proves that = inf{||i/;|| : w G Az}, and therefore that w* = wz 

as claimed. □ One way to compute wz would be therefore to compute v* 
by minimizing 

Although this is a tractable quadratic programming problem, a direct com- 
putation of Wz through the following proposition is usually prefered. 
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Proposition 3.3. The canonical direction wz can he expressed as 

N 

WZ = ^a*yiXi, 

i=l 

where (a|)^]^ is obtained by minimizing 

mi{F{a) -.aeA), 

where 

A = |(ai)ie/ e ]R,+,^aiyi = o|, 
and ^ 



Proof. Let w{a) = Y^^^i cxiyiXi and let S{a) = \ Ylii^i'^i- We can express 
the function F{a) as F{oi) = \\w{a)\\'^ — AS{oi). Moreover it is important to 
notice that for any s G 11+ {?j'(a) : a & A, S{a) = s} = sV. This shows that 
for any s G ]R,-|_, inf{F(a) : a G A,S{a) = s} is reached and that for any 
as € {a € A : 5(a) = s} reaching this infimum, ■w{as) = sv*. As 
s 8^11^*11^ — 48 : 11+ —>■ m reaches its infimum for only one value s* 
of s, namely at s* = ^^Jtp , this shows that F(a) reaches its infimum on 
A, and that for any a* £ A such that F{a*) = inf{F(Q;) : a G ^1}, 
w{a*) = -ptp^^* = wz- □ 

3.1.3. Support vectors. 

Definition 3.2. The set of support vectors S is defined by 

§ = {xi : {wz,Xi) -bz = yi}. 
Proposition 3.4. Any a* minimizing F{a) on A is such that 

{xi : a* > 0} C S. 

This implies that the representation wz = w{a*) involves in general only 
a limited number of non zero coefficients and that wz = wz', where Z' = 
{{xi.yi) : Xi G 8}. 
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Proof. Let us consider any given i G 1+ and j G /_, such that q* > and 
> (there exists at least one such index in each set /_ and I+, since 
the sum of the components of a* on each of these sets are equal and since 
Ylkei "^fe > 0) ■ t G m, consider 

ak{t) = al + tl{ke{i,j}), k e I. 

The vector a{t) is in A for any value of t in some neighborhood of 0, therefore 
'§i\t=o-^[^^^)] ~ ^- Computing this derivative, we find that 

yi{w{a*),Xi) + yj{w{a*),Xj) = 2. 

A-S Ui = —yj, this can also be written as 

yi[{w{a*),Xi) -bz] + yj[{w{a*),Xj) - bz] =2. 
As w{a*) G Az, 

yk[{w{a*),Xk) - bz] >l, k€l, 
which implies necessarily as claimed that 

yi[{w{a*),Xi) - bz] =yj[{w{a*),Xj) - bz] =1. 

□ 



3.1.4- The non separable case. In the case when the training set Z = 
{xi,yi)^i is not linearly separable, we can define a noisy canonical hyper- 
plane as follows. We can choose w and 6 G K, to minimize 



N 



Ciw, 6) = ^ [1 - {{w, xi) -b)yi]^ + l\ 



\w\ 



(3.1) 



i=l 



where for any real number r, r+ = max{r, 0} is the positive part of r. 



Theorem 3.5. Let us introduce the dual criterion 



N ^ N 



OtiXi 



i=l 



i=l 



and the domain A' = i^a G K,^ : ai < l,i = 1, . . . , N, yiai = o|. Let 

a* G A' be such that F{a*) = sup^^j^/ F(a). Let w* = X^^iyja*Xi. There 
is a threshold b* (whose construction will be detailed in the proof), such that 



C{w*,b*) 



inf C{w,b). 

«;eE.'',&eE. 
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Corollary 3.6. (scaled criterion) For any positive real parameter A 
let us consider the criterion 



N 

|2 



C^{w, 6) = ^ [1 - {{w, Xi) - b)yi] ^ + i 



i=l 

N 



and the domain A'x = s a G M.^ : ai < ,i = 1, . . . , N, yiUi = > . For 
any solution a* of the minimization problem F{a*) = sup^^j^/ F{a), the 



vector w* = X^^i yia^Xi is such that 

iniCx{w*,h)= inf Cx{w,h). 

beR u>eR'',6GR 

Let us remark that in the separable case, the scaled criterion is minimized by 
the canonical hyperplane for A large enough. This extension of the canonical 
hyperplane computation in dual space is often called the box constraint, for 
obvious reasons. 

Proof. The corollary is a straightforward consequence of the scale property 
Cx(w, b, x) = \'^C{\^^w, b, Xx), where we have made the dependence of the 
criterion in a; G H*^^ explicit. Let us come now to the proof of the theorem. 

The minimization ofC{w, b) can be performed in dual space extending the 
couple of parameters {w,b) tow = {w,b,^) € K,*^ x K, x M.^ and introducing 
the dual multipliers a G K,^ and the criterion 

TV N 

2 



G{a, w) = ^li + ^ "i{ [l - {{w, Xi) - b)yi] - 7i} + ^ 
We see that 



i=l i=l 



C{w,b) = inf sup G[a,{w,b,j)], 

and therefore, putting W = {(■«;, 6, 7) : w G M.'^,b G K,, 7 G 1R.+ }, we are led 
to solve the minimization problem 

G(a*,w*) = in^ sup G{a,w), 

whose solution ItJ* = (u)*, 5*, 7*) is such that C{w^,b^) = inf^^^b^gjj^d+i C{w, b), 
according to the preceding identity. As for any value of a' G M.^, 

inf_ sup G(a,w) > mf_G(a' ,w), 
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it is immediately seen that 

inf_ sup G{a,w) > sup mf_G{a,w). 

We are going to show that there is no duahty gap, meaning that this in- 
equahty is indeed an equahty. More importantly, we will do so by exhibiting 
a saddle point, which, solving the dual minimization problem will also solve 
the original one. 

Let us first make explicit the sohition of the dual problem (the interest of 
this dual problem precisely lies in the fact that it can more easily be solved 
explicitly). Introducing the admissible set of values of a, 

N 

A' = {a : < ai < l,i = 1, . . . , iV, ^ y^ai = O}, 

1=1 

it is elementary to check that 

mf_G{a,w) = 

As 

N 

G [a, {w, 0, 0)] = l\\wf + ^ ai (l - {w, Xi)yi) , 

1=1 

we see that inf^^jj^d G [a, [w, 0, 0)] is reached at 

N 

1=1 

This proves that 

in^G{a,w) = F{a). 

The continuous map a ^ inf—^^ G{a, w) reaches a (non necessarily unique) 
maximum a* on the compact convex set A'. We are now going to exhibit a 
choice of It;* G W such that {a* ,w*) is a saddle point. This means that we 
are going to show that 

G{a*,w*) = mf_G{a* ,W) = sup G{a,w*). 



inf G\a,{w, 0,0)1, aeA', 
—GO, otherwise. 
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It will imply that 



in^ sup G{a,w) < sup G{a,w*) = G{a* ,w*) 



on the one hand and that 



inf_ sup G{a,w) > mi_G{a* ,w) = G{a*,w* 



on the other hand, proving that 



G{a*,w*)= inf_ sup G{a,w) 

as required. 

Construction of w*. 

• Let us put w* = Wot* ■ 

• If there is j G {1, . . . , N} such that < < 1, let us put 

b* = {xj,w*)-yj. 

Otherwise, let us put 

b* = sup{{xi, w*)-l: a* > 0, = +1, i = 1, . . . , TV}. 

• Let us then put 



7i 



0, a* < 1, 

1 - i{w*,Xi) - b*)yi, a* = 1. 



If we can prove that 

I - {{w*,Xi) -b*)yi{ 



' < 0, a* = 0, 
= 0, < a* < 1, 
> 0, a* = 1, 



(3.2) 



it will show that 7* G and therefore that w* = {w*,b*,^*) e W. It will 
also show that 



N 



G{a,w*) = ^l*+ Yl ot^[l-{{w*,x^)-b*)yi\+U^*f, 
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proving that G{a*,w*) = sup^^^N G{a,w*). As obviously G{a*,w*) = 

G[a* , {w* ,0,0)], we already know that G{a*,w*) = inf—^y^ G{a* ,w). This 
will show that {a*,w*) is the saddle point we were looking for, thus ending 
the proof of the theorem. 

Proof of equation ()3.2() : Let us deal first with the case when there is 
j e {1, . . . ,N} such that < a* < 1. 

For any i G {1, . . . , N} such that < a* < 1, there is e > such that for 
any t G (— e, e), a* + tyiCi — tyjej G A', where (efc)^j^ is the canonical base 
of E,^. Thus ^[j^Q-^(a* + tyiCi — tyjej) = 0. Computing this derivative, we 
obtain 

d 

di\t-o^^^* ~ ^y^^^^ =yi- {w*,Xi) + {w*,Xj) - yj 

= yi[l- {{w,Xi) - b*)yi]. 

Thus 1— [{w,Xi) —b*)yi = 0, as required. This shows also that the definition 
of b* does not depend on the choice of j such that < a* < 1. 

For any i G {1, . . . , A^} such that a* = 0, there is e > such that for 
any t G (0, e), a* + ta - tyiyjCj G A'. Thus ^\^^QF{a* + tci - tyiyjej) < 0, 

showing that 1 — [{w*, Xi) — b*)yi < as required. 

For any i G {1, . . . , N} such that a* = 1, there is e > such that a* — 
tci + tyiyjCj G A' . Thus ^^^^QF{a* — tei + tyiyjej) < 0, showing that 1 — 

(^{'w*,Xi) —b*)yi > as required. This ends to prove that {a*,w*) is a saddle 
point in this case. 

Let us deal now with the case where a* G {0, 1}^. If we are not in the 
trivial case where the vector (y^i^i is constant, the case a* = is ruled out. 
Indeed, in this case, considering a* + tei + tej, where yiyj = —1, we would 
get the contradiction 2 = ■§i^-i.^QF{a* + tei + tej) < 0. 

Thus there are values of j such that a* = 1, and since YliLi '^iVi — 0) 
both classes are present in the set {j : a*^ = 1}. 

Now for any i,j G {1,...,A^} such that a* = a* = 1 and such that 
yi = +1 and yj = -1, ^|i=oi^(a* - ie, - tej) = -2 + {w*,Xi) - {w*,Xj) < 0. 
Thus 

sup{{w*,Xi) - I : a* = l,yi = +1} < mi{{w*,Xj) + 1 : a* = l,yj = -1}, 
showing that 

l-{{w*,xk)-b*)yk>0,al = l. 
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F{a* +tei - tej) = yi{w*,Xi - xj) < 0, 



Eventually, for any i such that a| = 0, for any j such that = 1 and 

Vj = Vi 

d_ 

dt\t=o' 

showing that 1 — (^{w*,Xi) — b*)yi < 0. This ends to prove that {a*,w*) is 
in all circumstances a saddle point. 

3.1.5. Support Vector Machines. 

Definition 3.3. The symmetric measurable kernel K : X x X ^ M. is said 
to be positive (or more precisely positive semi-definite) if for any n G M, 
any e X", 

n n 

i=l j = l 

Let Z = {xi,yi)^i be some training set. Let us consider as previously 

N 



i=l 

Let 



N N N 
1=1 j=l i=l 



Definition 3.4. Let ii' be a positive symmmetric kernel. The training set 
Z is said to be i^T-separablc if 



inf{F(a) : a e A] > -co. 

Lemma 3.7. When Z is K-separable, inf{F(Q;) : a G A} is reached. 
Proof. Consider the training set Z' = {x[,yi)^i, where 



{K{x,,xe)\ 



1/2 



N 



N 



We see that F{a) = |Pj=i — '^'^i=i<^i- We have proved in the 

previous section that Z' is linearly separable if and only if mf{F{a) : a G 
A} > —GO, and that the infimum is reached in this case. □ 
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Proposition 3.8. Let K be a symmetric positive kernel and let Z = 
{xi-iVijiLi be some K -separable training set. Let a* & A be such that F{a*) = 
inf{F(a) -.aeA}. Let 

r = {i G M : 1 <i < N,yi = -l,a* > 0} 
/* = {i G ]N ■.l<i<N,yi = +l,a*> 0} 

b* = -^'^a*yjK{xj,Xi_) + ^a*yjK{xj,Xi^)Y i- G II, i+ G /+, 

i=i i=i 

where the value of b* does not depend on the choice of i- and i+. The 
classification rule / : X — >■ ^ defined by the formula 

f{x) = sign ^^a*j/jiC(a;i,a;) - b*^ 

is independent of the choice of a* and is called the support vector machine 
defined by K and Z. The set 8 = {xj : ^f^iCi*yiK{xi,Xj) — b* = yj} is 
called the set of support vectors. For any choice of a* , {xi : a* > 0} C S. 

An important consequence of this proposition is that the support vector 
machine defined by K and Z is also the support vector machine defined by 
K and Z' = {{xi, yi) a* > 0,1 < i < N}, since this restriction of the index 
set contains the value a* where the minimum of F is reached. 

Proof. The independence from the choice of a*, which is not necessarily 
unique, is seen as follows. Let (xj)^^ and x G X be fixed. Let us put for 
ease of notations xn+i = x. Let M be the {N -\-\) x {N + 1) symmetric 
semi-definite matrix defined by M{i,j) = K{xi, Xj), i = 1, . . . , N + 1, j = 
1, . . . , A^'+l. Let us consider the mapping 
defined by 

nxi)=[M'/\i,j)]^^'eR''-^\ (3.3) 

Let us consider the training set Z' = (xi) , yi]^^^ Then Z' is linearly 
separable, 

II ^ 2 ^ 

F{a) = j'^aiyi^{xi) -2^ai, 
1=1 1=1 

and we have proved that for any choice of a* E A minimizing F{a), 
wz' = J2iLi (^iUi'^ixi)- Thus the support vector machine defined by K and 
Z can also be expressed by the formula 



f{x) = sign {wz', *(x)) - bz'] 
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which does not depend on a*. The definition of 8 is such that ^'(S) is the 
set of support vectors defined in the hnear case, where its stated property 
has already been prooved. □ 

We can in the same way use the box constraint and show that any solution 
a* G argmin{F(a) : a G A, < A^, i = 1, . . . , N} minimizes 

N r / N 

]yi 



^ N N 

). (3.4) 



i=i j=i 

3.1.6. Building kernels. The results of this section (except the last one) 
are drawned from ^H]. We have no reference for the last proposition of 
this section, although we believe it is well known. We include them for the 
convenience of the reader. 

Proposition 3.9. LetKi andK2 be positive symmetric kernels on X . Then 
for any a € M.+ 

{aKi + K2){x, x) aKi{x, x') + K2{x, x) 
and {Ki ■ K2){x,x') '= Ki{x,x')K2{x,x') 

are also positive symmetric kernels. Moreover, for any measurable function 
(7 : X ^ H, Kg[x,x') g{x)g{x') is also a positive symmetric kernel. 

Proof. It is enough to prove the proposition in the case when X is finite and 
kernels are just ordinary symmetric matrices. Thus we can assume without 
loss of generality that X = {l,...,n}. Then for any a € K,^, using usual 
matrix notations, 

(a, {aKi + K2)a) = a{a, Kia) + (a, K2a) > 0, 
{a,{Ki ■ K2)a) = ^aiKi{i,j)K2{iJ)aj 

'1/2/- 



^ aiKl^^i, k)Kl'^ {k, j)K2 {i, j)aj 

j;[i^y2(fc,i)a,]i^2(i,i)[i^y'(A:,i)a,] > 0, 



k ij 

^ 

>0 
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(a, Kga) = ^ aig{i)g{j)aj = I ^ aig{i) \ > 0. 

ij \ i / 



□ 



Proposition 3.10. Let K he some positive symmetric kernel on X. Let 
p : m — > m 6e a polynomial with positive coefficients. Let 5 : X — > K,'^ 6e a 
measurable function. Then 

p{K){x,x') p[K{x,x')], 
exp{K){x,x') ='^exp x')] 



and Gg{x,x') ex.p( 
are all positive symmetric kernels. 



\gix) - gix')f) 



Proof. The first assertion is a direct consequence of the previous proposi- 
tion. The second one comes from the fact that the exponential function is 
the pointwise hmit of a sequence of polynomial functions with positive co- 
efficients. The third one is seen from the second one and the decomposition 



Gg{x, x') 



exp( 



\gix)f) exp(-|b(x')f)l exp[2(<7(x),<7(x'))] 



□ 



Proposition 3.11. With the notations of the previous proposition, any 
training set Z = {xi,yi)fLi G (X x {— 1,-|-1})^ is Gg-separahle as soon 
as g{xi), i = 1, . . . , N are distinct points of M!^. 

Proof. It is clearly enough to prove the case when X = IR,'^ and g is the 
identity. Let us consider some other generic point a^^v+i G K."^ and define ^ as 
in 1)3. 3|) . It is enough to prove that ^'(xi), . . . , "^(xn) are affine independent, 
since the simplex, and therefore any affine independent set of points can be 
shattered by affine half-spaces. Let us assume that {xi, . . . , xn) are affine 
dependent, this means that for some (Ai, . . . , Xn) ^ such that YliLi = Oj 

N N 

'^'^XiG{xi,Xj)Xj = 0. 

i=i j=i 
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Thus, {Xi)i^ , where we have put Xn+i = is in the kernel of the symmetric 
positive semi-definite matrix G(xj, Xj)jjg{i^...^Ar+i}. Therefore 

N 

^XiG{xi,XN+i) = 0, 

i=l 

for any xat+i G IR,'^. This would mean that the functions x ^ exp(— 
are linearly dependent, which can be easily proved to be false. Indeed, let 
n G M!^ be such that ||n|| = 1 and {n,Xi), i = 1,... ,N are distinct (such 
a vector exists, because it has to be outside the union of a finite number 
of hyperplanes, which is of zero Lebesgue measure on the sphere). Let us 
assume for a while that for some (Ai)^;^ G E,^, for any x G M!^, 

N 

Aj exp(— ||x — Xilp) = 0. 

i=l 

Considering x = tn, for t G K,, we would get 

N 

^Xiexp{2t{n,Xi) - WxiW"^) = 0, t G K,. 
1=1 

Letting t go to infinity, we see that this is only possible if Aj = for all 
values of i. □ 

3.2. Bounds for Support Vector Machines. 

3.2.1. Compression scheme hounds. We can use Support Vector Machines 
in the framework of compression schemes and apply Theorem 12 . 1 71 on page 
11141 More precisely, given some positive symmetric kernel K on X, we may 
consider for any training set Z' = (x-, y-)^^^ the classifier fz' : X — > y which 
is equal to the Support Vector Machine defined by K and Z' whenever 
Z' is /C-separable, and which is equal to some constant classification rule 
otherwise (we take this convention to stick to the framework described on 
page llO^l we will only use fz' in the iC-separable case, so this extension of the 
definition is just a matter of presentation). In the application of Theorem 
12.171 in the case when the observed sample (^i,^i)i^i is ET-separable, a 
natural (if not always optimal) choice of Z' is to choose for {x'j) the set 
of support vectors defined hy Z = {Xi,Yi)f^-^ and to choose for {y[) the 
corresponding values of Y . This is justified by the fact that fz = fz'-, as 
shown in ProDosition l3.8l fDage [T^T|) . In the case when Z is not -ftT-separable, 
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we can train a Support Vector Machine with the box constraint, then remove 
ah the errors to obtain a /C-separable subsample Z' = {{Xi,Yi) : a* < 
A^, 1 < i < N}, (using the same notations as in equation ()3.4() on page 11321) 
and then consider its support vectors as the compression set. Still using the 
notations of page 11321 this means we have to compute successively a* G 
argmin{F(a) : a G A,ai < A^}, and a** G argmin{F(a) : a G A, = 
when a* = A^}, to keep eventually the compression set indexed by J = 
{i ■ 1 < i < N, a** > 0}, and the corresponding Support Vector Machine fj. 
Different values of A can be used at this stage, producing different candidate 
compression sets : when A increases, the number of errors should decrease, 
on the other hand when A decreases, the margin of the separable 

subset Z' increases, supporting the hope for a smaller set of support vectors, 
thus we can use A to monitor the number of errors on the training set we 
accept from the compression scheme. As we can use whatever heuristic we 
want while selecting the compression set, we can also try to threshold in 
the previous construction a** at different levels > 0, to produce candidate 
compression sets J?j = : 1 < i < A^, a** > i]} of various sizes. 

As the size \J\ of the compression set is random in this construction, we 
have to use a version of Theorem l2.17l (page lll4j) which handles compression 
sets of arbitrary sizes. This is done by choosing for each k a /c-partially 
exchangeable posterior distribution vr^ which weights the compression sets 
of all dimensions. We immediately see that we can choose iTk such that 



If we observe the shadow sample patterns, and if computer resources 
permit, we can of course use more elaborate bounds than Theorem 12.171 
such as the transductive correspondent to Theorem 11.241 (page 15^ (where 
we may consider the submodels made of all the compression sets of the same 
size). Theorems based on relative bounds, such as Theorem 11.591 ( paeelSS]) 
can also be used. Gibbs distributions can be approximated by Monte Carlo 
techniques, where a Markov chain with the proper invariant measure consists 
in suitable local perturbations of the compression set. 

Let us mention also that the use of compression schemes based on Support 
Vector Machines can be tailored to perform some kind feature aggregation. 
Imagine that the kernel K is defined as the scalar product in L2(7r), where 
vr G M,^{Q). More precisely let us consider for some set of soft classification 
rules {/e : X — > H ; G 0} the kernel 



In this setting, the Support Vector Machine applied to the training set Z = 



log[7rfc(Afc(J))] < log[| J|(| J| + 1)] + I J| log 
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{xi,yi)iLi has the form 



/z(a;) = sign / /^(x) yiai/e(xi)7r(d6l) - 5 
\Jee0 J 

and, may it be too burdening to compute, we can replace it with some finite 
approximation 



fz{x)= sign 



X Wk 



\k=l 



where the set {6k, k = 1, . . . , m} and the weights {wk, k = 1, . . . , m} are 
computed in some suitable way from Z' = {xi,yi)i^ai>o, the set of support 
vectors of fz- For instance, we can draw {6^, k = l,...,m} at random 
according to the probability distribution proportional to 



N 



^yiaife{x-d 



i=l 



TT{de), 



define the weights Wk by 

TV 



Wk 



sign y^yiOiife^ixi) I 
\i=i J -^se© 



N 



^Viaifeixi) 



i=l 



TT{de), 



and choose the smallest value of m for which this approximation still clas- 
sifies Z' without errors. Let us remark that we have built fz in such a way 
that 

lim fz{xi) = fz{xi) = Vi, a.s. 

m— >+oo 

for any support index i such that ctj > 0. 

Alternatively, given Z', we can select a finite set of features 0' C such 
that Z' is Kqi separable, where Kqi{x, x') = J2eeB' fd{^)fe{x') s-^d consider 
the Support Vector Machines fz' built with the kernel Kqi . As soon as 0' 
is chosen as a function of Z' only, Theorem 12.171 (page I114j) applies and 
provides some level of confidence for the risk of fz'- 

3.2.2. The Vapnik Cervonenkis dimension of a family of subsets. Let us 
consider some set X and some set 5 C {0, 1}^ of subsets of X. Let h{S) be 
the VC dimension of S, defined as 



h{S) = max{\A\ : A finite andAnS = {0, 1}^}, 
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where by definition A O S = {A f] B : B £ S}. Let us notice that this 
definition does not depend on the choice of the reference set X. Indeed X 
can be chosen to be IJ S, the union of all the sets in S or any bigger set. Let 
us notice also that for any set B, h{B r\ S) < h{S), the reason being that 

An{BnS) = Bn{AnS). 

This notion of VC dimension is useful because it can, as we will see about 
Support Vector Machines, be computed in some important special cases. Let 
us prove here as an illustration that h{S) = d + 1 when X = Mf^ and S is 
made of all the half spaces : 

S = {Ayj^b ■.weMf^,be K,}, where A^^i, = {x e X : {w,x) > b}. 



Proposition 3.12. With the previous notations, h{S) =d+l. 

Proof. Let iei)f^l be the canonical base of 11'^+^, and let X be the affine 
subspace it generates, which can be identified with IR,'^. For any {ei)f^l G 
{-1, let w = etCi and 6 = 0. The half space A^^b n X is such 

that {ej ; i = 1, . . . , d + 1} fl D X) = {ej ; = +1}. This proves that 
h{S)>d+l. 

To prove that h{S) < d+1, we have to show that for any set A c R'^ of size 
\A\ = d+2, there is B C A such that B ^ (AnS). This will obviously be the 
case if the convex hulls of B and A\B have a non empty intersection : indeed 
if a hyperplane separates two sets of points, it also separates their convex 
hulls. As \A\ > d + 1, A is affine dependent : there is (A,^.).^.^^ e R'^"^^ \ {0} 
such that J2xeA ~ ^ ^'^'^ Sa;eA ~ ^- "^^^ ^ ~ i-^' ^ ^ : A^; > 0} 
is non-empty, as well as its complement A \ B, because X^^.^^ A^; = and 
A 7^ 0. Moreover ^^^b = J2x&A\B > 0. The relation 

\~ ~ \^ \~ 5Z ~'^x^ 

l^x&B Ax l^xGB Ax ^g^^^ 

shows that the convex hulls of B and A\B have a non void intersection. □ 
Let us introduce the function of two integers 

fc=o ^ ^ 

Let us notice that $ can alternatively be defined by the relations : 

[ 2" when n < h, 

" ~ 1 ^n-l + ^n-i when n>h. 
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Theorem 3.13. Whenever [j S is finite, 



S\<<^(\\Js\,h{S)) 



Theorem 3.14. For any h < n 



< exp(n//(^)) < exp[/i(log(f ) + l)], 



where H{p) = — plog(p) — (1 — log(l — p) is the Shannon entropy of the 
Bernoulli distribution with parameter p. 

Proof of theorem 13. 13L Let us prove this theorem by induction on |U 5*1- 
It is easy to check that it holds true when ||J "S*! = 1- Let X = U S, let x € X 
and X' = X \ {x}. Define (A denoting the symmetric difference of two sets) 



Clearly, U denoting the disjoint union, S = S' U S" and S r\ X' = {S' r\ 
X') U {S" n X'). Moreover \S'\ = 2\S' n X'\ and \S"\ = \S" n X'\. Thus 
\S\ = \S'\ + \S"\ = 2\S' n X'\ + \S"\ = [5 n X'\ + \S' n X'\. Obviously 
h{S n X') < h{S). Moreover h{S' n X') = h{S') - 1, because if A C X' 
is shattered by S' (or equivalently by S' n X'), then A U {x} is shattered 
by S' (we say that A is shattered by S when S H A = {0, 1}^)- Using the 
induction hypothesis, we then see that \S nX'\ < ^['jfi^ + $f^f|^ \ But 

as \X'\ = \X\ — 1, the righthand side of this inequality is equal to 
according to the recurrence equation satisfyied by 

Proof of theorem 13.141 This is the well known Chernoff bound for 
the deviation of sums of Bernoulli r.v.: let (cii, . . . , cr„) be i.i.d. Bernoulli r.v. 
with parameter 1/2. Let us notice that 



S' = {Ae S : AA{x} e S}, 
S" = {Ae S : AA{x} ^ S}. 




For any positive real number A 
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Differentiating tlie riglit-hand side in A sliows tliat its minimal value is 
exp[-n3C(^,i)], where %{p,q) = plog(2) + (1 -p)log(|Ef) is the Kull- 
back divergence function between two Bernoulli distributions Bp and Bq 
of parameters p and q. Indeed the optimal value A* of A is such that h = 

— p( 1)]^ _ \ Therefore (using the fact that two Bernoulli 

distributions with the same expectations are equal) 
log{E[exp(-AVi)] } = -A*i?;,/„(ai) - aC(i?;,/„,i?i/2) = -A*^ - aC(^, \). 
The announced result then follows from the identity 

H{p) = log(2) - %{p, \) 

= plog(p-i) + {l-p) log(l + < p[log(p-i) + 1] . 

1— p 

3. 2. 3. VC dimension of linear rules with margin. The proof of the following 
theorem has been suggested to us by a similar proof presented in 



Theorem 3.15. Consider a family of points in some Eu- 

clidean vector space E and a family of affine functions 

^ = {gw,b ■■ E ^^■,w e E, \\w\\ = i,& G m}, 

where 

9w,b{x) = {w,x) -b, X e E. 
Assume that there is a set of thresholds (&i)r=i ^ such that for any 
{yi)i=i £ { — Ij+l}"; there is g^^ G such that 

n 

inf{g^^b{xi) - bi)yi > 7. 
Let us also introduce the empirical variance of (xj 

Var(xi, . . . ,x„) = - Xi 

i=i j=i 

In this case and with these notations, 

Varfxi , . . . , x„) In — 1 when n is even, 

\ > < 2_i (3.5) 

7 I ("- - 1)^112-^ when n is odd. 

Moreover, equality is reached when 7 is optimal, hi = 0, i = 1, . . . ,n and 
{xi, . . . ,Xn) is a regular simplex (i.e. when 27 is the minimum distance 
between the convex hulls of any two subsets of {xi, . . . and \\xi — Xj\\ 
does not depend on i^ j). 



n 
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Proof. Let G K," be such that Ylll=i = 0- Let a be a uniformly 

distributed random variable with values in the set of permutations of 
the first n integers {1, . . . ,n}. By assumption, for any value of cr, there is 
an affine function g-^^b G IK such that 



.^'^^Ji9w,b{xi) - k] [21(s^(j) > 0) - l] > 7. 
As a consequence 

I n \ n n 

( XI ^<^{i)^i^ ^ / = X ((^»' w) - b-hi)+^ s^(^^bi 

\i=l I 1=1 i=l 

n 

i=l 

Therefore, using the fact that the map x 1— > ^max|0,x}^ is convex. 



E 



> E 



max < 0, ^ 7l'Sa(i) I + 



1=1 



> |^max|o,^7E(|s^(i)|) +E(s^(,))6i|j = 7' (^Xl^^l j ' 

where E is the expectation with respect to the random permutation a. On 
the other hand 



E 



Moreover 



E 



i=l 



E(4,.,)4E(t«,.,)4E 

\i=l / i=l 



In the same way, for any i ^ 
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n{n — 1) 
1 



i=l 



n(n 



4 = 1 



Thus 



E 



i=l 



\i=i 

' n 



1^2 1 

n ^-^ n{n — 1 

1=1 i^j 



i=l 

1 

- + 



1 



n n{n 

n 



ly) Ell^-I 

i=l 



1=1 



n 



n 



n{n — 1) 



.1=1 



We have proved that 



Var(x 



^ {n-l)(j2\s.\) 



n 

"< \ " 2 

This can be used with = l(i < |) — l(i > |) in the case when n is even 
and Si = (^^^^ l(i < ■^^) — > iii the case when n is odd to 

estabhsh the first inequahty H3.5|) of the theorem. 

Checking that equality is reached for the simplex is an easy computation 
when the simplex {xi)^^^ ^ (IR.")'^ is parametrized in such a way that 



Xi{j) 



1 if i = j, 
otherwise. 



Indeed the distance between the convex hulls of any two subsets of the 
simplex is the distance between their mean values (i.e. centers of mass). □ 
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3.2.4- Application to Support Vector Machines. We are going to apply The- 
orem (page EHJ to Support Vector Machines in the transductive case. 
So let us consider {Xi,Yi)^'!^^^^ distributed according to some partially 
exchangeable distribution P and assume that and (Yi)^]^ are 

observed. Let us consider some positive kernel K on X. For any X-separable 
training set of the form Z' = where ^ yik+i)N ^ 

fz' be the Support Vector Machine defined by K and Z' and let "y{Z') be 
its margin. Let 

{k+l)N {k+l)N 

(k+l)N 

(A; + l)iV ^ ^ 

(This is an easily computable upper-bound for the radius of some ball con- 
taining the image of (Xi, . . . .,X{j^j^x)n) iii feature space.) 
Let us define for any integer h the margins 

-1/2 



72h = (2/i - 1) and 72/1+1 



2/1(1 ^ 



(3.6) 



(2/1 + 1)2, 

Let us consider for any h = 1, . . . ,N the exchangeable model 

Olh = {fz' : Z' = {Xi,y'^[tX^^'' is K-separable and -f{Z') > R-fh}. 

The family of models h = 1, . . . , is nested, and we know from Theorem 
ESI (page Cnni) and Theorems EUHl (page CHHl) and ITTl f page IHHl) that 

log(|3?,|)</ilog((^). 

We can then consider on the large model !Jl = (the disjoint union 

of the submodels) an exchangeable prior vr which is uniform on each Olh and 
is such that vr(3i/i) > j^]]^^- Applying Theorem 12.81 (page llU4() we get 

Proposition 3.16. With P probability at least 1 — e, for any h = 1, . . . , N, 
any Support Vector Machine f G Olh; 



r2{f) < 

k + 1 1-«^P 
mf 



-^ri{f) - Alog/ ^MIM^ - l°g[fe(fe+i)]-l°g(^) 



N' ) N h ; N 



k AGIR+ 1 - exp(- 



riU) 
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Searching the whole model 3ih may be unfeasible, nonetheless any heuristic 
can be applied to choose /. For instance, a Support Vector Machine /' can 
be trained from the training set ^'^d then (y-)j^^"^^^ can be set 

toy,' = sign(/'(Xi)), i = l,...,(A: + l)iV. 

3.2.5. Inductive margin hounds for Support Vector Machines. In order to 
establish inductive margin bounds, we will need a different combinatorial 
lemma. It is due to We will reproduce their proof with some tiny im- 
provements on the values of constants. 

Let us consider the finite case when X = {1, . . . , n}, y = {1, . . . , 6} and 
6 > 3 (the question we will study would be meaningless in the case when b < 
2). Assume as usual that we are dealing with a prescribed set of classification 
rules 

3i={/:X^y}. Let us say that a pair {A,s), where yl C X is a non 
empty set of shapes and s : A ^ {2, . . . ,b — l} a threshold function, is shat- 
tered by the set of functions F C 01 if for any {ax)xeA £ {—1; +1}"^) there 
exists some f & F such that min^gyi ax [f{x) — s{x)] > 1. 

Definition 3.5. Let the fat shattering dimension of (X, JV) be the maximal 
size 1^1 of the first component of the pairs which are shattered by 

Let us say that a subset of classification rules F C^"^ is separated when- 
ever for any pair {f,g) G such that f 7^ g, \\f - g\\oo = max^gx|/(a::) - 
g{x)\ > 2. Let 9Jl(3i) be the maximum size \F\ of separated subsets F of 
Jl. Note that if F is a separated subset of "Jl such that \F\ = DJl(5l), then it 
is a 1-net for the distance: for any function f 51 there exists g & F 
such that 11/ — (7II00 < 1 (otherwise / could be added to F to create a larger 
separated set). 

Lemma 3.17. With the above notations, whenever the fat shattering di- 
mension of (X, 31) is not greater than h, 

logl^m] < log [(6 - 1)(6 - 2)n] { ^QgEti + 1 1 + iog(2) 



<log[(6-l)(6-2) 




(fe-2)n 
h 



log(2) 
+ 1 



^ +li+log(2). 



log(2) 



Proof. For any set of functions F C y"*^, let t{F) be the number of pairs 
{A, s) shattered by F. Let t{m, n) be the minimum of t{F) over all separated 
sets of functions F C of size \F\ = m (n is here to recall that the shape 
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space X is made of n shapes). For any m such that t(m, n) > '}2^=i il) 
it is clear that any separated set of functions of size |-F| > m shatters at 
least one pair {A,s) such that \A\ > h. Indeed, t{m,n) is clearly from its 
definition a non decreasing function of m, so that t{\F\,n) > Yli=i (T) (^~^)*- 
Moreover there are only X^f^j^ — 2)* pairs {A, s) such that \A\ < h. As a 
consequence, whenever the fat shattering dimension of (X, JV) is not greater 
than h we have 9Jt(3i) < m. 

It is clear that for any n > 1, t{2,n) = 1. 

Lemma 3.18. For any m > 1, t[mn(6 — 1)(6 — 2),n] > 2t[m,n — l] , and 
therefore t[2n{n - I) . . . {n - r + l){b - ly {b - 2Y,n] > 2^ 

Proof. Let F = {/i, . . . , fmn{b-i){b-2)} be some separated set of functions 
of size mn{b—l){b—2). For any pair (/2j-i, f2i), i = 1, ■ ■ ■ , mn{b—l){b—2)/2, 
there is Xj G X such that \f2i-i{xi) — f2i{xi)\ > 2. Since |X| = n, there 
is X £ X such that = x) > m{b - 1)(6 - 2)/2. Let 

I = {i : Xi = x}. Since there are (6 — 1)(6 — 2)/2 pairs (2/1,2/2) £ 
such that l<yi<y2 — 1^^ — 1) there is some pair (2/1,2/2)) such that 
1 < 2/1 < 2/2 < ^ and such that J2iel H{yi,y2} = {f2i~i{x), f2i{x)}) > m. 
Let J = {i G / : {/2i-i(a;), /2i(a;)} = {2/1,2/2}}- Let 

-^1 = {/2i-l : « € J, f2i-l{x) = 2/1} U {/2j : z e J, /2i(x) = 2/1}, 
-^2 = {/2i-l : i G J, /2i-l(a;) = 2/2} U {/2i : i G J, /2i(a;) = 2/2}- 

Obviously |-Fi| = |i^2| = \ J\ = m. Moreover the restrictions of the functions 
of Fi to X \ {x} are separated, and it is the same with F2. Thus Fi strongly 
shatters at least t(m, n — 1) pairs {A, s) such that A C X \ {x} and it 
is the same with F2. Eventually, if the pair {A,s) where A C X \ {x} is 
both shattered by Fi and -F2, then Fi U F2 shatters also {A U {x}, s') where 
s'(x') = ■s(x') for any x' £ A and s'(x) = [ ^^^^^ J . Thus F1UF2, and therefore 
-F, shatters at least 2t{m,n — 1) pairs {A,s). □ 

Resuming the proof of lemma 13 .171 let us choose for r the smallest integer 
such that 2' > Y!1=i (") {h - 2)\ which is no greater than 

[ i°g[Eti(:)(^-2)'] , A 

\ log(2) +^J- 

In the case when 1 < n < r, 

log(9Jt(3?)) < |X|log(jy|) = nlog(6) < rlog(6) < r log[(6-l)(6-2)n] +log(2), 
which proves the lemma. In the remaining case n > r, 
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t[2n''{b-lY{b-2Y,n\ 

> t[2n{n - 1) . . . (n - r + 1)(6 - lY{b - 2)^ n] 

h 



>E 

i=l 



n 



{h-2)\ 



Thus < 2 (6 - 2) (6 - l)n as claimed. □ 

In order to apply this combinatorial lemma to Support Vector Machines, 
let us consider now the case of separating hyperplanes in K,'^ (the gener- 
alization to Support Vector Machines being straightforward). Assume that 
X = IR'^ and y = {-1, +1}. For any sample 



{k+l)N 



let 



^^^(fc+i)iv^ = max{||Xi|| ■.l<i<{k + l)N]. 

Let us consider the set of parameters 

e = G R'^ X R : \\w\\ = 1}. 

For any (w, b) G 0, let gw,b{x) = {w, x) — b. Let h be some fixed integer and 
let 7 = R{x[''^^^^)jfi, where is defined by equation (|3.6() on page 11421 
Let us define ( : K, 



C(r) 



^by 








' -5 


when 




r < —47, 


-3 


when 


-47 


<r < -27, 


-1 


when 


-27 


<r < 0, 


< 

+1 


when 





<r < 27, 


+3 


when 


27 


<r < 47, 


.+5 


when 


47 


<r. 



Let Gw,bix) = C[9w,bix)]- The fat shattering dimension (as defined in 13. 5|) 
of 



{(G„,i, + 7)/2 :(•«,,(>) €9 



is not greater than h (according to Theorem l3.151 page ll3iH) . therefore there 
is some set 1 of functions from {— 5, — 3, — 1, +1, +3, +5} such 

that 



log(|:F|) < log[20(/c + l)iV] 



h 



log(2) 



log 



4(fc + l)iV 
h 



+ 1 



+ l^+log(2). 
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and for any (ti;, 6) G 0, there is /^^^ G IFsuchthat sup{|/^^b(Xj)— : 
i = 1, . . . , (/c + 1)A^} < 2. Moreover, the choice of fu,,b may be required to 
depend on in an exchangeable way. Similarly to Theorem 12.81 

(pageESJ) it can be proved that for any partially exchangeable probability 
distribution P € M5,_(r2), with P probability at least 1 — e, for any f^^h € 3", 



{k+l)N 
i=N+l 

'+\mni-exp(-A)]-'{l 

N 



< 



k AeiR,+ 
exp 



'N2 



i=l -I J 



1 ^ 



Let us remark that 
l{2l[5^,6(X,) > 0] - 1 / y^} = l[G^,fe(X,)yi < 0] <t[U,b{Xi)Yi < 1] 
and 

t[U,b{Xi)Yi < 1] < < 3] < t[g^,biXi)Yi < 47]. 

This proves the following theorem. 

Theorem 3.19. With P probability at least 1 — e, for any {w, h) G 6, 

(k+l)N 

— i{2i[5.,,(x,)>o]-i^y.} 



i=N+l 

k + l 



<U-± inf [l-exp(-A)] ^ H 



N 



exp 



A 



i=l 



log[20(/c + l)Ar]{^log 



ie(k+l)N 



N 



+ U + log 



2fe(fe+l) 
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1 ^ 



kN 
1=1 



As a consequence, we obtain with P probability at least 1 — e, for any 
{■w,b) & @ such that 

1=1,. ...Af 



{k+l)N 



kN 

i=N+l 



<^U-exp 



log[20(fc+l)Af] f i6ii2+27^ l„„/ e(fc+l)A^7^ \ , 1 
\ log{2)7^ ^°^[ m ) + ^ 



+ ^log(f) 



This inequality compares favourably with similar inequalities in jJH], which 
moreover do not extend to the margin quantile case as this one. 

Let us also remark that it is easy to circonvent the fact that R is not 
observed when the test set X^j^^^^^ is not observed. 

Indeed, we can consider the sample obtained by projecting x[^~^^^^ on 
some ball of fixed radius Rma.x, putting 



"rnax 
1^ 

We can further consider an atomic prior distribution € Mj')_(]R-|-) bearing 
on -Rmax; to obtain a uniform result through a union bound. As a conse- 
quence of the previous theorem indeed, 

Corollary 3.20. For any atomic prior v € M|'^(]R-)-), for any partially 
exchangeable probability measure P G M3)_(r2), with P probability at least 
l-e, for any {w, b) G 6, any R^a.^ G R+, 
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{k+l)N 



J2 i{2i[5^,bot«_(x,)>o] -i^y,} 



< 



=Ar+i 
k + 1 



k AeE.+,/ieiN 

TV 



inf_ [l-exp(-A)] \l- 



exp 



i=l 



log [20{k + 1) AT] { ^ log (M^) + i} + lo, 



2h(h+l) 

e!^(Rmax) 



N 



1 ^ 

T7 ^ 1 [9w,b o tR^^{Xi)Yi < AR^^^h] ■ 



kN 



i=l 



4. Appendix: classification by thresholding 

In this appendix, we show how the bounds given in the first section of 
this monograph can be computed in practice on a simple example: the case 
when the classification is performed by comparing a series of measurements 
to threshold values. Let us mention that our description covers the case when 
the same measurement is compared to several thresholds, since it is enough 
to repeat a measurement in the list of measurements describing a pattern 
to cover this case. 

4.1. Description of the model. Let us assume that the patterns we 
want to classify are described through h real valued measurements normal- 
ized in the range (0, 1). In this setting the pattern space can thus be defined 
as X = (0,1)'^. 

Consider the threshold set T = (0, 1)'* and the response set = y^^'^^^. 
For any t G (0, 1)'' and any a : {0, 1}'' ^ y, let 

f{t,a)ix) = a| [l{x^ > tj)]^^ J, xeX, 

where is the jth coordinate of x G X. Thus our parameter set here is 
@ = 7 X ^. Let us consider on 7 the Lebesgue measure L and on the 
uniform probability distribution U. Let our prior distribution he tt = Li^iU. 
Let us define for any threshold sequence t & 7 

At = {t'e7: Jt^ n {X/; i = 1, . . . , AT} = 0, j = 1, . . . , 
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where Xj is the jth coordinate of the sample pattern Xi, and where the 
interval {t'-,tj) of the real line is defined as the convex hull of the two point 
set {t'ptj}, whether t'^ < tj or not. We see that At is the set of thresholds 
giving the same response as t on the training patterns. Let us consider for 
any t € T the middle 

/^^ t'Ljdt') 

= -T(A;r 

of At. The set At being a product of intervals, its middle is the point whose 
coordinates are the middle of these intervals. Let us introduce the finite set 
T composed of the middles of the cells A^, which can be defined as 

T={t£'J:t = m{At)}. 

It is easy to see that \T\ < {N + 1)'^ and that \Jl\ = 

4.2. Computation of inductive bounds. For any parameter (t, a) G 
T X 31 = 0, let us consider the posterior distribution defined by its density 



dn ' vr(At x {a}) 

Let us notice that we are in fact considering a finite number of posterior 
distributions, since P(t,a) = P{m{At),a)-, where m{At) G T. Let us also mention 
that for any exchangeable sample distribution P G [(X x y)^"*"^] and any 
thresholds t G T, 



P 



(x^^i,t,)n{x/,i = i,...,iV} = 



< 



+ 1 



Thus, for any (t, a) G 0, 

P{p(i,a)[/.(^^+l)] / /(t,a)(^7V+l)} < 

showing that the classification produced by P{t,a) on ii^w examples is most 
of the time non random (this result is only indicative, since it is concerned 
with a non random choice of (t,a)). 

Let us then compute the various quantities needed to apply the results of 
the first section, focussing our attention of Theorem ll.391 (page lSH]) : 

It is to be noted first of all that /9(t^a)(^) = ^[(^i*^)]- The entropy term is 
such that 

3C(aa,vr) = -\og[iT[At X {r})] = -\og[L{At)] +2Mog(|y|). 
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Let us notice accordingly that 



min 3C(p(<,,),7r) </ilog(Ar+l)+2''log(|y|). 



Let us introduce the counters 

N 



1=1 

ier,cG{o,if ,yey, 

b'ic) = ^ 6* (c) = ^ E l{ [^^^^ - *^)]?=i =4' ^ e ^' c 6 {0, If. 



Since 



-a{c)(c)]: 



r[{t,a)]= [b\c)-bl 

cG{0,l}'' 

the partition function of the Gibbs estimator can be computed as 

r ^ 

7r[exp(-Ar)] =^L(Ai)^^exp -A J] 1 [F, / 



tGT 



i=l 



'''' '- CG{0,1}'> 



aGX 



E^(^*) n 

cG{0,l}'' 



tGT 



^j;exp(-A[6*(c)-6*(c)]) 



We see that the number of operations needed to compute 7r[exp(— Ar)] is 
proportional to |r| X 2'' X IVI < (A^ + l)'^2^\y\. An exact computation wih 
therefore be feasible only for small values of A^ and h. For higher values, a 
Monte Carlo approximation of this sum will have to be performed instead. 

If we want to compute the bound provided by Theorem 11.391 (page IHS)) . 
we need also to compute, for any fixed parameter 6 G Q, quantities of the 
type 

7rexp(-Ar){exp[Cm'(-,6')]| = vrexp(-Ar){exp[Cpe(m')] }, A,^ G ]R,+ . 
To this purpose we need to introduce 

N 

N 



i=l 
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Similarly to what has been done previously, we obtain 
7r{exp[-Ar + ^m'(-,6')]} 

= ^L(A,) n \^^Y.^w(-\[b\c)-bl{c)]+^-^y{e,c) 



We can then compute 



itji 

ce{0,i}'' ' ' yey 



d 



T^eM->^r)ir) = -^log{7r[exp(-Ar)]}, 

r ft f All TT{exp[-Xr + ^m'{-,9)]} 
vrexp(-A.)|-P[^P^(- )] I = ,[exp(-Ar)] ' 



9 



15=0 



7r{exp[-Ar + ^m'(-,6')]} 



This is all we need to compute B{pQ,f3,'y) (and also B{Tr^^p(^_xr), P,'y)) in 
Theorem 11.391 (page I53() , using the approximation 

log{vrcxp{-Air) exp{^7rexp(_A2r)("^')} } 

< log{7rcxp(-Air) exp{^m'(-,6')} ^ + ^TT^^p(^x2r)['m'{-,9)], C > 0. 

Let us also explain how to apply the posterior distribution in other 

words our randomized estimated classification rule, to a new pattern Xjv+i: 

P(t,a)[/.(^7v+i) = y]= i(At)-i f t[a{[t{xj,^, > t;.)]'=J = y]Hdt') 

J At 

ce{o,i}'' 

Let us define for short 

A,(c) = [f G A, : [HXj,^, > = c}, c € {0, 1}\ 

With this notation 

Pit,a)[f-(^N+i)=y]=L{At)-' Yl L[Atic)]l[aic)=y]. 

ce{o,i}'' 

We can compute in the same way the probabilities for the label of the new 
pattern under the Gibbs posterior distribution: 
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7rexp(-Ar) [/-(^AT+l) = v'] 



, . Eyey Hy = y') exp{-A[b*(c) - 6* (c)] } ' 
^ E,e.exp{-A[6*(x)-6*(c)]} 



ce{o,i} 



ce{o,i}'' 2/Gy 



4.3. Transductive bounds. In the case when we observe the patterns 

of a shadow sample on top of the training sample (Xj, we 

can introduce the set of thresholds responding as t on the extended sample 



)i=i 
At 



,...,^|, 



[f it'^,tj) n {Xi;i = l,...,{k + 1)N} = 0,j = 1 
consider the set 

T={te7:t = m(At)}, 

of the middle points of the cells A^, t € 7, and replace the Lebesgue mea- 
sure L G MJ!,_ [(0, 1)'*] of the previous section with the uniform probabil- 
ity measure L on T. We can then consider vr = L ^ U, where U is as 
previously the uniform probability measure on Ji. This gives obviously an 
exchangeable posterior distribution and therefore qualifies tt for transduc- 
tive bounds. Let us notice that \T\ < [{k + 1)N + l]'', and therefore that 

7r(t,a) > [(A; + l)A^ + l]"''|y|-2\forany {t,a) eTxJl. 

For any {t,a) G T x we may similarly to the inductive case consider the 
posterior distribution P(^t,a) defined by 

Mt,a) ^^> ^,^^Ht' eAt)lia' = a) 



dir 



7r(At X {a}) 



but we may also consider S(^rn(Kt)a)^ which is such that ri{[m{At) , a]} 
ri[{t,a)], z = 1,2, whereas only P(t,a){ri) = ri[{t,a)], while 



P{t,a){r2) 



ITnAA e 

' ' fern At 



E ^2r,a)]. 
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We get 

3C(p(t,,),7r) = -log[L(A,)] +2Mog(|y|) 

<log(|T|)+2Mog(|y|)=aC(<5[^^(SO,a]>^) 

< hlog[ik + l)N + 1] +2Mog(|y|), 

whereas we had no such uniform bound in the inductive case. Similarly to 
the inductive case 

7r[exp(-An)] =^L(A,) J] E^^p(-4^*W " ^^W] 

Moreover, for any E 0, 



TT 



{exp[-Ari +Cpe{in')]] = 7r{exp[-Ari +^m'(-,6')]} 

The bound for the transductive counter part to Theorem 11.391 fpage HIH]). 
obtained as explained page I1U21 can be computed as in the inductive case, 
from these two partitions functions and the above entropy estimates. 

Let us mention eventually that, using the same notations as in the induc- 
tive case, 



= (e n \^\E--^(-mc)-biic 

ueTce{o,i}'^ yey 



E,eyl(y = y0exp{-A[6*(c)-6*(c)]} 



E,.,exp{-A[6*(x)-6*(c)]} 
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